Concurrent loop/string processing to speed up Livecode

sriethmuellerBUSfZ9Y · Post by **sriethmuellerBUSfZ9Y** » Sun Dec 01, 2019 12:00 am

Hi,

I am trying to find a way to process a large amount of text (strings) by breaking up the text into subtexts and processing each subtext concurrently. I am not sure whether this is possible in Livecode.

More specifically:

I have a list of x number of words and, in this example, want to parse the text (e.g. a large document, think book) for each of these words, and then record the line and position in the line of each of these words in the text/document. The actual processing is much more involved but for purposes of this question, let's keep the actual processing in each line of text relatively simple.

Here is the sample script:

on mouseup
Put the milliseconds into sectest
put empty into allcheckedwords

#words for parsing go into array
put 0 into x
repeat for each line tline in field 4
add 1 to x
put word 1 of tline into termarray[x]
end repeat

#parse document with each of the words in the array
put field 1 into fl1
repeat for each element tele1 in termarray
put tele1 into checkword
put 0 into y
repeat for each line tline2 in fl1
add 1 to y
put 0 into z
repeat for each word tword in tline2
add 1 to z
if tword is checkword then put y &" " & z & " " & checkword & return after allcheckedwords
end repeat
end repeat
end repeat

put allcheckedwords into field 2

put the milliseconds-sectest into secshow
put secshow&return after field 5
end mouseup

If I use a document with 83,500 words and 11,000 lines, and have 11 different words (like, "is" "in" "upon" etc) to search, the entire process takes about 420 milliseconds.

I would like to get the processing time down to 200 milliseconds or less in this example. The actual number of words that are parsed by the program is actually much larger (like 200 different words).

I understand about locking the screen and I am using repeat for each, etc. to speed the process up.

In LiveCode 6.5 and similar early versions, this type of processing was exceptionally fast. With LiveCode 9.5, it is a lot slower. So to speed up the processing, I am wondering whether there is a way to break up the document into two (or more) sub-docs and parse each sub-doc concurrently.

Thus, I could find the midpoint in the lines of the document and then process concurrently the first half and the second half of each document (i.e., the lines 1 to the MidPoint become Sub-Doc1 and lines MidPoint+1 to last line become Sub-Doc2). LiveCode should process each Sub-Doc more or less simultaneously to cut the processing time in half. The results could then be merged in the end and the midpoint by simply be added to the line numbers for the Sub-Doc2.

Is something like this possible in LiveCode? The actual code involves a lot more processing in each line of the document than illustrated in the above example, so fundamentally different approaches would be very difficult to implement. I am already using filters like "is among" etc.

Thus, parallel or concurrent loop-based processing would be ideal.

Any thoughts would be greatly appreciated!

dunbarx · Post by **dunbarx** » Sun Dec 01, 2019 12:22 am

Hi.

Glancing at your handlers (do please put them into the "</>" space) you seem to be efficient in that you are only processing variables, and not ever, say, fields.

That said, LC can only run one "process" at a time. So though you can perform what I call a "binary search", where you parse a dataset into halves, based on the result of some filtering result, and then divide again and again, so that the total burden is split into ever smaller pieces, that is not the same as splitting a task into two separate concurrent handlers, that would work twice as fast.

Craig

FourthWorld · Post by **FourthWorld** » Sun Dec 01, 2019 12:35 am

sriethmuellerBUSfZ9Y, if you can post the data you're working with, or a reasonable facsimile of it, we may be able to help find an optional algo for processing it.

sriethmuellerBUSfZ9Y · Post by **sriethmuellerBUSfZ9Y** » Sun Dec 01, 2019 12:49 am

Thx. You could use any text but here is a link to the type of documents (converted into plain text)

https://nvca.org/wp-content/uploads/201 ... ement.docx

It's the model stock purchase agreement from the National Venture Capital Organization. It has about 1,500 lines and 17,500 words. I converted the entire set toLower

It took 265 milliseconds to process the following 10 words:

the
is
company
both
party
parties
in
after
upon

FourthWorld · Post by **FourthWorld** » Sun Dec 01, 2019 1:16 am

Thanks for the data.

How is the position information used?

sriethmuellerBUSfZ9Y · Post by **sriethmuellerBUSfZ9Y** » Sun Dec 01, 2019 1:34 am

The actual application uses the location data (line and word position in line) to then highlight certain instances of the specified terms in the document (e.g. a contract). For example, after the location position has been collected, the application highlights certain instances of a term as selected by the user in the document by using different foreground or background colors at these locations of the selected term in the document/text.

[-hh] · Post by **[-hh]** » Sun Dec 01, 2019 10:27 pm

You could try to do the check for the terms as innermost, not as outer loop:

Code: Select all

on mouseup
  lock screen; lock messages
  put the milliseconds into ms
  put fld "words" into terms
  put field "IN" into strng
  repeat for each line L in strng
    add 1 to nl; put 0 into nw
    repeat for each word W in L
      add 1 to nw
      if W is among the lines of terms then
        add 1 to h
        put cr& nl && nw && W after allHits
      end if
    end repeat
  end repeat
  put char 2 to -1 of allHits into field "OUT" #<-- needs 20% of time
  put the num of lines of strng & " lines / " & \
        the num of words of strng & \
        " words / " & h & " hits / " & \
        the milliseconds-ms & " ms" into fld "timing"
end mouseup

sriethmuellerBUSfZ9Y · Post by **sriethmuellerBUSfZ9Y** » Mon Dec 02, 2019 1:48 am

Thank you very much. I think the issue is that the code does not allocate the specific term to the location. I will need to map each instance of a specific term to a specific location.

Since concurrent parsing does not seem to work, I have instead started to narrow the starting line and end line of the document/text using lineOffset. That way for each term, the parsing starts only with the first line which contains the term and I am working on identifying the last line containing the term. That way the parsing of each term is limited to the part of the document that contains one or more occurences of the term.

This looks promising. Will report on the improvements in speed when done.

FourthWorld · Post by **FourthWorld** » Mon Dec 02, 2019 3:19 am

LC doesn't support multithreading, but can be used with multiprocessing, using a pool of standalones as workers to map-and-reduce tasks.

But the overhead of concurrency should not be needed here.

If you record only the word positions (using trueWord to account for punctuation), selecting them later is as simple as:

Code: Select all

select trueWord (word 1 of the hilitedText of fld "FoundList") of fld "Main"

...and it simplifies the indexing down to a single loop:

Code: Select all

on mouseUp
   put the millisecs into tStartTime
   put fld "terms" into tTermList
   put fld "main" into tText
   set the wholematches to true
   put 0 into i
   repeat for each trueWord tWord in tText
      add 1 to i
      put lineoffset(tWord, tTermList) into tOffset
      if tOffset = 0 then next repeat
      put i && line tOffset of tTermList &cr after tOutList
   end repeat
   put tOutList into fld "FoundList"
   put the millisecs - tStartTime &" ms"
end mouseUp

On my system your original code takes ~280 ms, and this lighter algo completes in ~40 ms.

If it would be helpful to work directly on DocX and ODT files, we can explore parsing out the relevant portions with the revZip external, and traversing them with the revXML external.

sriethmuellerBUSfZ9Y · Post by **sriethmuellerBUSfZ9Y** » Mon Dec 02, 2019 5:40 am

Thank you! That's a great starting point. There is quite a bit more processing than simply highlighting a term but the approach of parsing the text/document and then using lineOffset to check if word is in the list of lines is a good starting point. I will see if it is ultimately faster than the current approach when factoring all the processing I am doing with the terms but it's a great start! Much appreciated.

FourthWorld · Post by **FourthWorld** » Mon Dec 02, 2019 7:58 am

If you need to know the line for later interactions with the list created from this algo, the charIndex function may be useful. With that you can obtain the number of chars before a given trueWord (or any other chunk), and use that to get the number of lines, or just about any other info you'd need, during user interaction when you have time to burn (humans are generally slower than machines), keeping the upfront work of the indexing as sparse as practical.

rkriesel · Post by **rkriesel** » Thu Dec 05, 2019 3:48 am

Another way to speed up the processing is to eliminate the chunk counting, by invoking the split command.

The code from [-hh] derives line numbers and word numbers within the lines of the text.
The code from FourthWorld derives word numbers within the whole text.
With split they're about 20% faster.

Perhaps the OP would find useful some code that produces an array of all word numbers for each term:
tWordNumbersForTerm[ <term> ][ <word number> ] = "true"
That would be a step toward custom property set uDocumentIDsAndWordNumbersForTerm with a custom property for each document.

Code: Select all

on mouseUp
   put short name of me & cr into msg
   put fld "main" into tText -- all lower case
   put fld "terms" into tTermList -- all lower case
   
   repeat for each word tHandler in "lineAndWordNumbersViaRepeat lineAndWordNumbersViaSplit wordNumbersViaRepeat wordNumbersViaSplit wordNumbersForTerm"
      put the milliseconds into tMilliseconds
      
      dispatch function tHandler with tText, tTermList
      get the result -- fast due to copy-on-write
      
      put tHandler && the milliseconds - tMilliseconds & " ms" & cr after msg
      put it into tResultForHandler[ tHandler ]
   end repeat
   
   breakpoint
end mouseUp

# by [-hh] » Sun Dec 01, 2019 2:27 pm; lightly edited for comparing
function lineAndWordNumbersViaRepeat strng, terms
   repeat for each line L in strng
      add 1 to nl
      put 0 into nw
      repeat for each trueWord W in L
         add 1 to nw
         if W is among the lines of terms then
            put cr & W && nl && nw after allHits
         end if
      end repeat
   end repeat
   return char 2 to -1 of allHits
end lineAndWordNumbersViaRepeat

function lineAndWordNumbersViaSplit pText, tTermList
   split tTermList by cr as set
   repeat for each line tLine in pText
      add 1 to tLineNumber
      put 0 into tWordNumber
      repeat for each trueWord tWord in tLine
         add 1 to tWordNumber
         if tTermList[ tWord ] then
            put "true" into tCoordinatesForTerm[ tWord ][ tLineNumber ][ tWordNumber ]
         end if
      end repeat
   end repeat
   return tCoordinatesForTerm
end lineAndWordNumbersViaSplit

# by FourthWorld » Sun Dec 01, 2019 7:19 pm; lightly edited for comparing
function wordNumbersViaRepeat pText, pTermList
   set the wholeMatches to true
   repeat for each trueWord tWord in pText
      add 1 to i
      put lineOffset( tWord, pTermList ) into tOffset
      if tOffset = 0 then next repeat
      put i && line tOffset of pTermList & cr after tOutList
   end repeat
   return tOutList
end wordNumbersViaRepeat

function wordNumbersViaSplit pText, tTermList
   split tTermList by cr as set
   repeat for each trueWord tWord in pText
      add 1 to i
      if tTermList[ tWord ] then
         put i && tWord & cr after tOutList
      end if
   end repeat
   return tOutList
end wordNumbersViaSplit

# create array tWordNumbersForTerm[ <term> ][ <word number> ] = "true"
function wordNumbersForTerm pText, tTermList
   split tTermList by cr as set
   repeat for each trueWord tWord in pText
      add 1 to i
      if tTermList[ tWord ] then
         put "true" into tWordNumbersForTerm[ tWord ][ i ]
         -- add 1 to tWordCountForTerm[ tWord ] -- not useful for comparing but exemplary
      end if
   end repeat
   return tWordNumbersForTerm
end wordNumbersForTerm

-- Dick

FourthWorld · Post by **FourthWorld** » Thu Dec 05, 2019 4:39 am

Good work, Dick.

I had forgotten about the "as set" option with the "split" command, but alas it's not mentioned in the dictionary. Where can I find a description of what that does?

rkriesel · Post by **rkriesel** » Thu Dec 05, 2019 6:17 am

FourthWorld wrote: ↑
Thu Dec 05, 2019 4:39 am
... I had forgotten about the "as set" option with the "split" command, but alas it's not mentioned in the dictionary. Where can I find a description of what that does?

Dictionary:split, section:Description, paragraph:6+

<aside>How could the dictionary evolve to help more?</aside>

FourthWorld · Post by **FourthWorld** » Thu Dec 05, 2019 5:09 pm

rkriesel wrote: ↑
Thu Dec 05, 2019 6:17 am

FourthWorld wrote: ↑
Thu Dec 05, 2019 4:39 am
... I had forgotten about the "as set" option with the "split" command, but alas it's not mentioned in the dictionary. Where can I find a description of what that does?
Dictionary:split, section:Description, paragraph:6+

Indeed it does. As a Linux user, LC is only partially supported on that platform, and among other missing features is that we have no working browser widget, and thus no Dictionary. So as a workaround for the years the browser widget isn't working, the LC IDE now points to the web version of the Dictionary.

Maybe I was coffee deficient yesterday, or maybe the online Dict was updated, but I could have sworn I'd searched that page for "as set" and came up empty. But this morning that search does indeed yield a description:

If you use the as set form the split command converts the passed variable to an array with the keys being equal to the original list and the values being true. For example, the following statements create an array: put "A apple,B bottle,C cradle" into myVariable split myVariable by comma and space KEY VALUE A apple B bottle C cradle

It seems to highly specific option to add, I'm curious about the use case that prompted it.

<aside>How could the dictionary evolve to help more?</aside>

First, the online version needs to be kept current with the last stable version.

Second, the Linux version of LC needs a working browser control so it becomes possible once again to consult the Dict offline.

Third, the formatting is FUBAR compared to the old version made entirely in LC, a proven useful solution for delivering information systems. Time and again we discover anomalies in formatting and content which are by-products of HTMLs tag handling, and the impact on users ranges from difficulty in finding things (due to paragraph breaks lost in translation) to erroneous information (due to some characters like "<" not being escaped to meet HTML's requirements).

Additionally, like much of the product experience, it would be very beneficial to the perceived value of LC to have a style guide for all UI elements, and ensure that all components of the IDE adhere to it. This would include font sizing, use of white space, and other general layout considerations that make the IDE feel more like a unified, flowing experience than a hodge podge of found objects from parts unknown. The Dict, for example, looks like it was a generic template made by someone else for some other purpose, bearing no similarity to anything else in the product experience, and without any effort to update the CSS to even try to come close.

And lastly, having someone with skills and experience in UX contributing to the Dict would help. It's a fine work in terms of engineering, but a poor user experience. Having defaults that include everything under the sun turned on results in MANY searches yielding seemingly duplicate results, mish-mashing glossary and other elements with the actual Dict definitions needed for immediate scripting. This creates uncertainty for the user, who is unsure which of several similar or even identical entries to pick, creating at least lost time and at worst an adverse reaction the the product design. Some prudent judgment about use cases would clean that up, making it more efficient and enjoyable to use.

LiveCode Forums

Concurrent loop/string processing to speed up Livecode

Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode

Re: Concurrent loop/string processing to speed up Livecode