How many rocks in Scotland?

Simon Knight · Post by **Simon Knight** » Mon Oct 05, 2020 12:47 pm

Yes its another speed challenge!

I have an idea that I want to add some sailing hazards to my Garmin GPS as "Points of Interest". These can be hand crafted but being lazy I've been seeking some data on line. I found the OpenStreetMap site and discovered that they have a sub set named OpenSeaMap. The map data is freely available is several formats from various web sites and I downloaded the data for Scotland because I enjoy getting wet and bitten by midges.

If you are interested the data may be downloaded from this web page http://download.geofabrik.de/europe/gre ... tland.html by downloading the file on the link scotland-latest.osm.bz2, the bz2 file is a zip archive and is 320 Mbytes which expands to 3.2Gbytes of xml.

My stack parses the data and extracts xml records named/tagged as nodes to a new file which is saved in the same folder as the original xml file. On my mac the stack takes 27.5 minutes to run and outputs 149 records that describe rocks. I have no idea if this is correct but it seems a little on the low side.

I have avoided the Livecode XML tools as Mark Waddingham has briefed that they are not very fast on large files of several megabytes. My code is very much a first draft and I'm sure it can be improved upon.

The main parsing routine is here for those who don't want to download the stack file:

Code: Select all

On ParseFile pFile
   --   Constant kStartChar = "<"
   --   Constant kEndChar = ">"
   --   Constant kNodeEnd = "</node>"
   
   Put the milliseconds into tStartTime
   put "Started at : " & tStartTime & "ms" & cr into field "debug"
   
   ## define two vars that are used to filter the records output to file
   put ("<tag k=" & quote & "seamark:") into tTag
   put  quote & "rock" & quote into tTag2
   
   ## Create name for output file
   set itemdel to slash
   put pFile into tFileSpec
   delete the last item of tFileSpec
   put slash & "OSM-Scotland-Extract-Seamarks-Rocks.xml" after tFileSpec
   
   ## Open the outputfile for er output
   open file tFileSpec for append
   
   ## Open the inputfile for reading
   Open file pFile for text read
   
   ## position file pointer to begining of the data i.e. after the header cruft
   read from file pFile until "<node"  -- first node in file
   put the length of it into tcharPosn
   put tcharPosn-5 into tcharPosn -- move the file pointer back to begining of node data
   read from file pFile at 1 for tCharPosn
   put it into tData  -- for debug
   
   ## Setup some variables
   put "Running" into tStatus
   put 0 into tCounter
   --put true into tEndNotFound
   --put true into tSeekStartChar
   
   ## Loop through the file
   repeat until tStatus = "eof" --OR tCounter = 11
      
      read from file pFile until "<node" 
      put the Result into tStatus
      put it into tData -- for debug purposes
      
      read from file pFile until ">" 
      put the Result into tStatus
      put it into tData 
      
      --put word 1 of tData into tTagName
      --if tTagName is not "node" then next repeat
      
      if char -2 of tData is "/" then 
         next repeat
      end if
      
      ## it is a node and it is not an isolated node with no tags
      put tData into tRecord  -- store first portion of record
      ## Read rest of record until the node close
      read from file pFile until "</node>" 
      put the Result into tStatus
      put it into tData 
      put it after tRecord
      
      ## output the record to the output file if it is a rock
      put "<node " before tRecord
      
      if tRecord contains tTag AND trecord contains tTag2 then
         add one to tCounter
         write tRecord & cr to file tFileSpec
      end if
      
   end repeat
   close file pFile
   close file tFileSpec
   
   Put the milliseconds into tEndTime
   put "Ended at : " & tEndTime & "ms" & cr after field "debug"
   
   Put tEndTime - tStartTime into tElapsedTime
   put tElapsedTime/60000 into tElapsedTime
   put "Run Time : " & tElapsedTime & " mins" & cr after field "debug"
   put tCounter & " records created in output file" & cr after field "debug"
   
end ParseFile

best wishes and stay safe,

Simon

Thierry · Post by **Thierry** » Tue Oct 06, 2020 4:58 pm

Simon Knight wrote: ↑
Mon Oct 05, 2020 12:47 pm
On my mac the stack takes 27.5 minutes to run

Hi Simon,

A very quick answer...

Running your stack without any changes:

33 minutes on my mini-mac2 and find 149 records too.

Now, rethinking the logic and rewriting from scratch:

18 seconds and still find 149 records.

... and outputs 149 records that describe rocks.
I have no idea if this is correct but it seems a little on the low side.

I didn't check and don't know for sure
if it is the expected number of records.

Will get in touch with you later with all the details... I have to go...

Regards,

Thierry

Simon Knight · Post by **Simon Knight** » Tue Oct 06, 2020 6:00 pm

EIGHTEEN SECONDS - get out of here!!!!!!!!

I can't wait to learn how you have achieved such a massive jump in speed.

I'll obviously have to give up programming

best wishes

S

jacque · Post by **jacque** » Wed Oct 07, 2020 5:16 pm

Knowing Thierry, it involves regex.

SparkOut · Post by **SparkOut** » Wed Oct 07, 2020 5:50 pm

Knowing Thierry, it will be an extremely good solution, and probably involve regex.

bn · Post by bn » Wed Oct 07, 2020 9:14 pm

Here is my take on this.

On my machine 2017 MacBookPro with SSD it takes roughly 60 seconds.

That is more than 18 seconds for Thierry's solution but still less than the original solution.

It finds 148 occurrences in the 3.55 GB file Simon indicated. That is 1 less than Simon and Thierry find.
I could not find an explanation for it.

However it has 5 hits that contain

<tag k="seamark:type" v="rock"/>

but they seem to belong other map information. (they will be shown in the message box)

Code: Select all

on mouseUp pMouseButton
   answer file "choose"
   if it is empty then exit mouseUp
   put empty into field "Debug"
   put empty into msg
   put the milliseconds into t
   
   put it into tFile
   put 0 into tSum
   
   open file tFile for binary read 
   
   put "<tag k=" & quote & "seamark:type" & quote & space  & "v=" & quote & "rock" & quote & "/>" into tTag
   put "<node" into tBeginTag
   put "</node>" into tEndTag
   
   put empty into tCollect
   
   put 6 into tHeadLines -- lines to search above hit
   put 4 into tTrailLines -- lines to search below hit
   
   repeat 40
      read from file tFile for 100000000
      if the result is "EOF"  then
         exit repeat
      end if
      
      put it into tData
      
      read from file tFile until tEndTag
      put it after tData
      
      put 0 into tSkip
      
      repeat
         put lineOffset(tTag, tData, tSkip) into tOffsetLine
         if tOffsetLine > 0 then
            add tOffsetLine to tSkip
            put line (tSkip - tHeadLines) to (tSkip + tTrailLines) of tData into tHit
            
            put tHeadLines into tSubOffset
            
            -- the last 5 entries seem to be not part of the localisation data
            -- they will be shown in the message box
            if not (tHit contains tEndTag) then
               put tHit & cr & cr after msg
               next repeat
            end if
            
            repeat with i = tSubOffset + 1 to the number of lines of tHit
               if word 1 of line i of tHit begins with tEndTag then
                  delete line i + 1 to -1 of tHit
                  exit repeat
               end if
            end repeat
            
            repeat with i = tSubOffset down to 1
               if word 1 of line i of tHit begins with tBeginTag then
                  delete line 1 to (i -1) of tHit
                  exit repeat
               end if
            end repeat
            
            put tHit & cr & "-----" & cr after tCollect
            add 1 to tSum
            
         else -- tOffset > 0
            exit repeat
         end if
      end repeat
      
   end repeat
   close file tFile 
   put the milliseconds - t & "ms" & " found " & tSum & " occurences" & cr & cr & tCollect  into field "Debug"
end mouseUp

This goes into a button in Simon's stack.

(LC does not really like working with HUGE data like this)

Kind regards
Bernd

bn · Post by bn » Wed Oct 07, 2020 11:46 pm

Got it down to about 40 seconds by playing whith the amount read from the big file for each iteration.
This is a bit tricky and probably depends on the hardware, especially Harddisc vs. SolidStateDisk.
For SSD 10 MB for each iteration seems to be optimal.

Here is the current version

Code: Select all

on mouseUp pMouseButton
   answer file "choose"
   if it is empty then exit mouseUp
   put empty into field "Debug"
   put empty into msg
   put the milliseconds into t
   
   put it into tFile
   put 0 into tSum
   
   open file tFile for binary read 
   
   put "<tag k=" & quote & "seamark:type" & quote & space  & "v=" & quote & "rock" & quote & "/>" into tTag
   put "<node" into tBeginTag
   put "</node>" into tEndTag
   
   put empty into tCollect
   
   put 6 into tHeadLines -- lines to search above hit
   put 4 into tTrailLines -- lines to search below hit
   
   repeat
      read from file tFile for 10000000
      if the result is "EOF"  then
         exit repeat
      end if
      
      put it into tData
      
      read from file tFile until tEndTag
      put it after tData
      
      put 0 into tSkip
      
      repeat
         put lineOffset(tTag, tData, tSkip) into tOffsetLine
         if tOffsetLine > 0 then
            add tOffsetLine to tSkip
            put line (tSkip - tHeadLines) to (tSkip + tTrailLines) of tData into tHit
            
            put tHeadLines into tSubOffset
            
            -- the last 5 entries seem to be not part of the localisation data
            -- they will be shown in the message box
            if not (tHit contains tEndTag) then
               put tHit & cr & cr after msg
               next repeat
            end if
            
            repeat with i = tSubOffset + 1 to the number of lines of tHit
               if word 1 of line i of tHit begins with tEndTag then
                  delete line i + 1 to -1 of tHit
                  exit repeat
               end if
            end repeat
            
            repeat with i = tSubOffset down to 1
               if word 1 of line i of tHit begins with tBeginTag then
                  delete line 1 to (i -1) of tHit
                  exit repeat
               end if
            end repeat
            
            put tHit & cr & "-----" & cr after tCollect
            add 1 to tSum
            
         else -- tOffset > 0
            exit repeat
         end if
      end repeat
      
   end repeat
   close file tFile 
   put the milliseconds - t & "ms" & tSum & " occurences" & cr & cr & tCollect  into field "Debug"
end mouseUp

Kind regards
Bernd

Simon Knight · Post by **Simon Knight** » Thu Oct 08, 2020 11:40 am

Hi Bernd,

Its interesting that such a significant increase in processing speed can be achieved by modifying the code to read in blocks of data into memory before conducting the search for characters. I suppose it occurs because your code does not require as many read operations - forty versus tens of thousands. The 10 mega byte sweet spot is harder to understand and is possibly related to an ideal memory block size. In the past computers such as the 8 bit BBC micro had the concept of pages of memory although I have no idea if the concept is still used in modern 64 bit systems .

I feel a trial coming on!

best wishes

Simon

AxWald · Post by **AxWald** » Thu Oct 08, 2020 1:05 pm

Hi,

did a few trials, looked closer at the XML, went into "wtf mode".
Loaded it into my text editor (EditPad pro), and a minute later had an index file.
This are, of all lines that contain "seamark", those that contain "rock", too. (Should be similar to what Bernd uses)
LineNumber of the OSM file, tab, line content. 302 hits.

Who in a mental state allowing unattended participation in social life would create such a beast as this file? 3.3 GB, containing at least 1/3rd white noise ... Anyways. Maybe the index can be of help for someone.

I just refuse to work with such abominations: Ceterum censeo XML esse delendam ;-))

Have fun!

bn · Post by bn » Thu Oct 08, 2020 1:12 pm

Simon Knight wrote: ↑
Thu Oct 08, 2020 11:40 am
Its interesting that such a significant increase in processing speed can be achieved by modifying the code to read in blocks of data into memory before conducting the search for characters. I suppose it occurs because your code does not require as many read operations - forty versus tens of thousands. The 10 mega byte sweet spot is harder to understand and is possibly related to an ideal memory block size.

Hi Simon,

I think that the 10 mega byte sweet spot is a combination of a couple of things. Disc latency, access time, memory, and LC that I make count line numbers. To reduce counting of line numbers I access the line number for each hit only 2 times, the text above the hit and below the hit.
I do the cleanup in the variable tHit. I think that also helps to reduce the overall time. Think counting lines in a 100 MB vs. 10 MB chunk.

Kind regards
Bernd

bn · Post by bn » Thu Oct 08, 2020 6:23 pm

Hi Simon,

Interesting news:

in your version change

Code: Select all

## Open the inputfile for reading
   Open file pFile for text read

to

Code: Select all

## Open the inputfile for reading
   Open file pFile for binary read

and for me with that change the timing was

Started at : 1602175881354ms
Ended at : 1602176012382ms
Run Time : 2.1838 mins
149 records created in output file

I found the 1 record difference. I get 148 records your method gets 149 records

The one you find in addition to mine is

<node id="663979365" version="2" timestamp="2016-04-19T06:49:07Z" lat="57.0315558" lon="-5.7216846">
<tag k="name" v="Sgeirean Ghlasa"/>
<tag k="source" v="survey"/>
<tag k="disused" v="yes"/>
<tag k="natural" v="rock"/>
<tag k="man_made" v="beacon"/>
<tag k="source:name" v="npe"/>
<tag k="seamark:type" v="beacon_special_purpose"/>
<tag k="seamark:topmark:shape" v="cylinder"/>
</node>

this is because I search explicitely for

Code: Select all

<tag k="seamark:type" v="rock"/>

whereas you search for "("<tag k=" & quote & "seamark:")" and "rock" in the same _record_, not in the same line. Hence the additional record.

Code: Select all

if tRecord contains tTag AND trecord contains tTag2 then

And as to why "binary read" works it because the file has ASCII 10 (linefeed) as line delimiter, the same as LC uses internally. No ASCII 13 (return) in the file. So no need for LC to attempt to transform the line endings to ASCII 10 what it does for "text read".

Kind regards
Bernd

Simon Knight · Post by **Simon Knight** » Fri Oct 09, 2020 10:10 am

AxWald

Who in a mental state allowing unattended participation in social life would create such a beast as this file? 3.3 GB, containing at least 1/3rd white noise ... Anyways. Maybe the index can be of help for someone.

Welcome to OpenStreetMap data; the file is tiny compared with say all of Europe or the the World.

Bernd :

## Open the inputfile for reading
Open file pFile for binary read

Outstanding ! Talk about a simple software fix. I'm almost moved to learn how to update the dictionary entry :

You can optionally specify either text or binary mode. If you specify text mode, when you use the write to file command to put data in the file, any line feed and return characters are translated to the appropriate end-of-line marker for the current operating system before being written to the file. The end-of-line marker on Mac OS and OS X systems is a return character; on Unix, a line feed; on Windows, a CRLF. When you use the read from file command to get data from the file, end-of-line markers are translated to the return constant, and any null characters are translated to spaces (ASCII 32).

How about the insertion of

Note this translation requires considerable CPU time and slows the processing of large files. For maximum speed use the binary form.

preferably in a red type face.

bn · Post by bn » Fri Oct 09, 2020 12:11 pm

Simon Knight wrote: ↑
Fri Oct 09, 2020 10:10 am
How about the insertion of
Note this translation requires considerable CPU time and slows the processing of large files. For maximum speed use the binary form.
preferably in a red type face.

You would have to make clear that this is only safe if the file is suitable for binary acces which is then used for text analysis. I do not know what happens with unicode stuff. High ASCII (>127) is also translated by the "text" form to the platform specific values. That could be a problem for text analysis with high ASCII characters.
In this specific case this is no problem.

With some very clear caveats you could add it as a foot note as a rare and specific option. Otherwise the risk of confusion and unexpected results is very high.

Kind regards
Bernd

bn · Post by bn » Fri Oct 09, 2020 12:47 pm

I came across binary reading here

http://forums.livecode.com/viewtopic.php?f=9&t=3690

Garrett wanted to index a 23 Gigabytes Wikipedia xml file. It took him 3 days to do it, with some optimization it went down to 30 minutes.

While looking into it I found the speed advantage of binary read

http://forums.livecode.com/viewtopic.php?f=9&t=3728

But at that time the speed advantage or using binary over text for read was not as dramatic as it is now. LC must be doing more checking with the unicode stuff. (a guess)

A little history...

Kind regards
Bernd

Klaus · Post by **Klaus** » Fri Oct 09, 2020 1:32 pm

I'm extremely curious how many rocks actually are in Scotlannd in the end!

LiveCode Forums

How many rocks in Scotland?

How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?

Re: How many rocks in Scotland?