Page 1 of 1
For each line loop slows down
Posted: Thu Aug 25, 2016 5:00 pm
by edmerckx99
i have a script that is parsing a a large text file, ~30,000 lines with around 40-50 items per line. its taking quite a long time to process it, sometimes 45mins. if i just loop through each line without parsing it, running my code, it takes about 3 minutes.
one of the interesting things is the script seems to slow down overtime. i am trying to convert each line to a sql insert command so i can bulk import them to a sqlite database.
Here is my code
Code: Select all
function prepareSQL pText, pColumns
delete the first line of pText
local sqlText, sqlLineTemplate, theLines
put 1 into theLines
put "INSERT INTO 'TheTable'" && "(" & theDatabaseColumns & ")" && "VALUES" into sqlLineTemplate
repeat for each line tLine in pText
local tempLine
put tLine into tempLine
if item 1 of tempLine is not empty then
put quote before item 1 of tempLine
put quote after item 1 of tempLine
end if
local x = 1
repeat for each item tItem in tempLIne
if theLines is 1947 then
if theSchemaArray[x] is "INTEGER" and item x of tempLine is " " then
put "NULL" into item x of tempLine
end if
end if
if tItem is empty then
put "NULL" into item x of the last line of tempLine
else
if Not(tItem contains quote) and not (item x of tempLine is a number) then
if theSchemaArray[x] is "INTEGER" then
repeat with y = number of chars in tItem down to 1
if char y of tItem is not a number then
delete char y of tItem
end if
end repeat
if tItem is empty then put "NULL" into item x of tempLine
end if
end if
end if
add 1 to x
end repeat
put sqlLineTemplate && "(" & tempLine & ");" into line theLines of sqlText
add 1 to theLines
wait 0 millisec with messages
updateProgressBar theLines
end repeat
return sqlText
end prepareSQL
*edit sorry forgot the last 2 lines
Re: For each line loop slows down
Posted: Thu Aug 25, 2016 5:12 pm
by dunbarx
Hi.
Your code offering is incomplete, and therefore impossible to evaluate. I would make a 30,000 line sample text file to do so, but have no way forward.
Craig Newman
Re: For each line loop slows down
Posted: Thu Aug 25, 2016 5:46 pm
by Thierry
Hi,
Having a very quick look,
here is the first thing I will do:
Replacing this:
Code: Select all
wait 0 millisec with messages
updateProgressBar theLines
by that:
Code: Select all
if ( theLines mod 300) is 0 then
wait 0 millisec with messages
-- will show every new percent
updateProgressBar theLines
end if
What are your timings then?
Thierry
Re: For each line loop slows down
Posted: Thu Aug 25, 2016 7:27 pm
by FourthWorld
For general tips on speeding up LiveCoce scripts, this thread is well worth the read:
http://forums.livecode.com/viewtopic.php?f=8&t=24945
In your case, though, it seems you have SQL inserts taking place without transactions, yes? If so I'll wager that just opening a transaction before the insert and closing afterward will speed up that routine many-fold.
Re: For each line loop slows down
Posted: Fri Aug 26, 2016 1:38 am
by edmerckx99
Thierry,
it seems to have sped it up a bit, around 1/2hr instead of about 40 minutes. but it still seems to be slowing down as the script goes along.
FourthWorld wrote:For general tips on speeding up LiveCoce scripts, this thread is well worth the read:
http://forums.livecode.com/viewtopic.php?f=8&t=24945
In your case, though, it seems you have SQL inserts taking place without transactions, yes? If so I'll wager that just opening a transaction before the insert and closing afterward will speed up that routine many-fold.
i am not doing executing any sql here, i am prepping the text for a bulk transaction. all of my test are not including the sql transactions.
Here is the interesting part i added to the update section some code that calculates the milliseconds between the 300 lines.
Code: Select all
if (theLines mod 300) is 0 then
put the millisecs into t2
wait 0 millisec with messages
updateProgressBar theLines
put t2 - t1 into line timingLine of fld "TimingLine"
add 1 to timingLine
put the millisecs into t1
end if
the first 300 took 111ms
*** i have all times between the start and finish just didn't think added all 100 lines here. ***
the last 300 took 23553ms
Re: For each line loop slows down
Posted: Fri Aug 26, 2016 4:39 am
by FourthWorld
If this:
Code: Select all
put sqlLineTemplate && "(" & tempLine & ");" into line theLines of sqlText
...is changed to this:
Code: Select all
put sqlLineTemplate && "(" & tempLine & ");" &cr after sqlText
...then you get to take more full advantage of using "repeat for each" rather than "repeat with", since using the chunk expression "into line" requires the engine to count all line within sqlText each time through the loop.
You may be able to extend that to some of the item expressions as well, appending the items being traversed with "repeat for each" into a new line built up with "put...after...", avoiding expressions that require counting delimiters like "...item x of..."
Appends in LC are generally pretty fast.
Re: For each line loop slows down
Posted: Fri Aug 26, 2016 5:19 am
by edmerckx99
wow, that was it. it is now taking just 2 minutes. i really didn't think that using
Code: Select all
put "text" into line x of <container>
would have cause that much of a slow down.
thank you!
Re: For each line loop slows down
Posted: Fri Aug 26, 2016 11:57 am
by Thierry
edmerckx99 wrote:Thierry,
it seems to have sped it up a bit, around 1/2hr instead of about 40 minutes. but it still seems to be slowing down as the script goes along.
Hi,
So now that you have reach the 2 minutes,
if you send updateProgressBar for every line, you should see that it's not a detail anymore.
Here is the interesting part i added to the update section some code
that calculates the milliseconds between the 300 lines.
Code: Select all
if (theLines mod 300) is 0 then
put the millisecs into t2
wait 0 millisec with messages
updateProgressBar theLines
put t2 - t1 into line timingLine of fld "TimingLine"
add 1 to timingLine
put the millisecs into t1
end if
the first 300 took 111ms
*** i have all times between the start and finish just didn't think added all 100 lines here. ***
the last 300 took 23553ms
Yes, that's the way it works with LC since a long time.
If the 2 minutes are still too long for your purpose,
then I suggest to work with an external file for the output.
This way, you're not buffering your "sqlText" output into LC.
I've done that for big data quite a few times with good results.
My 2 cents
Thierry
Re: For each line loop slows down
Posted: Fri Aug 26, 2016 3:39 pm
by FourthWorld
edmerckx99 wrote:wow, that was it. it is now taking just 2 minutes. i really didn't think that using
Code: Select all
put "text" into line x of <container>
would have cause that much of a slow down.
thank you!
Happy to help. LiveCode is a very capable language, made even better with flexible syntax that generally prevents us from having to think like a computer. But once in a while, esp. when performance is critical, thinking like a computer can help.
Chunk expressions like "item tSomething of line tSomethingElse" are very powerful, offering graceful parsing that rivals sed or awk for many tasks, but
much more readable.
And in most cases they're pretty efficient, with the traversal of the string and keeping track of delimiters all done in reasonably well optimized C++.
But in loops, esp. loops working on large data, it pays to be mindful of any chunk expressions which may cause redundant string traversal.
The "repeat for each" form keeps a pointer into the source string as it goes, so it never needs to look for more than one delimiter. The "repeat with" form doesn't normally do that, and when its iterator variable is used in a chunk expression (like "get line i of tSomething") that's when we find that in order for the engine to know what we mean it needs to traverse the string, counting line delimiters as it goes, until it finds the i-nth one.
I've been delightfully surprised with how efficiently LiveCode does appends. I have no idea how they manage such dynamic memory allocation under the hood, but it's pretty nice. In many cases where I need to transform a large chunk, I find that parsing it with "repeat for each" and then building up a tranformed copy with append operations is much faster than transforming in place in the source text, since it never needs to count anything as it goes.