Performance badly degrading with growing string variable

Havanna · Post by **Havanna** » Sun May 15, 2016 6:24 pm

With LC 8.0 Indy (and older versions)

I often have to change values in large (>12 MB) csv files.

Normally I put the file into an input variable,
repeat with each Line some conversion,
put the changed line after an output variable,
finally write this output variable back to the file.

This works just fine, but processing time skyrockets with the number of added output lines:
A simple test routine verifies this:
On my machine adding the first 10000 lines to a string takes about 13 ms, lines 200000 to 210000 uses up half a second and 490000 to 500000 is nearly 1,5 sec.

With some other processing this is adding up (in the actual stack to a running time of nearly half an hour).
I hope someone has an easy way around this problem

FourthWorld · Post by **FourthWorld** » Sun May 15, 2016 6:32 pm

Is this timing different than with v7?

And can you post your code? We've had very good luck working together here to optimize things.

Havanna · Post by **Havanna** » Sun May 15, 2016 7:10 pm

Performance seems to be slightly better in 8.

A test stack would have a scrolling field named "f_show"
and a button to run the test with the following script

Code: Select all

on mouseUp
   put 10000 into tChunkSize
   put 300000 into tMaxIterations
   put 0 into tCounter
   put 0 into tTeller 
   put milliseconds() into tNow
   put "Milliseconds used for " & tChunkSize & " iterations" & cr into field f_Show   
   
   repeat with i = 1 to tMaxIterations
      put  "The quick brown fox jumps over the lazy dog" & cr after tBigData
      add 1 to tcounter
      if tcounter > tChunkSize then
         add tChunkSize to tTeller
         put (milliseconds() - tNow) & ": " & tTeller & cr after field f_Show
         put milliseconds() into tNow
         put 0 into tCounter
      end if
   end repeat
   
   put (milliseconds() - tNow) & ": " & i after field f_show
end mouseUp

FourthWorld · Post by **FourthWorld** » Sun May 15, 2016 7:38 pm

in repeat operations you'll normally see a rather huge speed boost by first taking the data out of a field object, working on it in a variable within the loop, and then putting it into the field for display when you're done.

In your case, it seems you're logging time increments as you go. Could still be done in a variable,though - do you need to see that output as it goes.

Also, clearly that's sample data, since the algo would seem to have no value other than to measure itself. To help find the best solution for your needs it would be helpful to better understand the problem you're solving. What do you need to do?

Havanna · Post by **Havanna** » Mon May 16, 2016 9:48 am

Sorry, I may not have been clear enough:

The job requires a number of csv-files (about 80 MB) to be put into an SqLite database.
this means to translate data to SQL and on the fly do some conversions, like exchanging commas for dots as data comes from a german system that would use the comma as a decimal point. Other Changes have to happen, so the whole skript gets rather complicated.
It takes a long time to run and I could pin this down to the string growing.

As you noticed my example here does nothig but add one line to the string, over and over.
The time measuring is only for this testing situation and does not significantly affect the timing problem.
But when you run it you can really see processing time going up terribly - weird, and I wonder ...

FourthWorld · Post by **FourthWorld** » Mon May 16, 2016 4:14 pm

CSV is a notoriously FUBAR format. Fortunately one of our community members, Alex Tweedly, came up with a nice algo that parses it with reasonably good efficiency into a format that lends itself much better to working with - see the middle of this page:
http://www.fourthworld.com/embassy/arti ... t-die.html

Than handler returns the data in a format where fields are separated by tabs, lines with CR, and by default it replaces in-data CRs with ASCII 11 (though there are params to use a different character if needed).

To import your data into SQLite I'd suggest:
1. Get the list of files'
2. For each, read it in,
2.1 Pass it through CsvToTab to get a format that's efficient to work with
2.2 Make adjustments to that format for import as needed for SQL
2.3 Do the insert between OPEN TRANSACTION and CLOSE TRANSACTION SQL statements

Tips to keep things efficient:

1. Constantly adding data to a field as shown doesn't do anything for the processing of the files and only slows things down progressively, as you've seen, since the data gets longer and longer.

2. Showing progress can be useful, but any indication of progress takes time away from the data processing so you'll want to keep those to the minimum needed.

3. If each file processes quickly enough, your progress indicator should just show files, e.g.:

Code: Select all

put "Processing file "& i &" of "& tNumberOfFiles into fld "Progress"

4. Where possible, use "repeat for each line..." rather than "repeat with i =..." for handling loops, as the former does a smarter job of keeping track of where it is and parsing at is goes, while the latter needs to start at the beginning of the chunk in each iteration and count i number of lines to get to the one to be worked on.

Hopefully that outline and those tips will speed things up for you by at least an order of magnitude. If not please consider posting the full code here so we can help optimize it for you.

This thread is long but perhaps a good read, a discussion of an algo that originally took 9 minutes ultimately brought down to just a few millisecs, and along the way explaining each of the alterations to the algo to achieve those speeds:
http://forums.livecode.com/viewtopic.php?f=8&t=24945

Havanna · Post by **Havanna** » Tue May 17, 2016 6:51 am

Thanks for all your considerations, Richard!
The actual processing algo isn't the problem, as it includes TRANSACTION, repeat for each …
It really comes down to "put somting after some string" taking terribly long with bigger strings.
In this teststack, writing 500000 dummy lines costs almost 36 seconds

I finally got around this by doing it in smaller chunks:
append Data to a string tBigDataPart,
after a bunch of Lines append tBigDataPart to tBigData reduces total time of this test drastically:
with chunks of 1000 Lines to 3.5 seconds
with chunks of 10000 Lines to 0.8 seconds
with chunks of 20000 Lines to 1.2 seconds
The optimum number varies with the lenght of Lines to add.

AxWald · Post by **AxWald** » Fri May 20, 2016 1:59 pm

Hmmm.

Havanna wrote:Normally I put the file into an input variable,
repeat with each Line some conversion,
put the changed line after an output variable,
finally write this output variable back to the file.

This looks very slow. Why don't you just replace in the input variable and use this then for output? Why do you do it line by line? Why not:

Code: Select all

  repeat with i = 1 to Length(MyVar)
      put char i of MyVar into MyChar
      if MyChar = "," then put "." into char i of MyVar
      else if MyChar = Quote then put "'" into char i of MyVar
      else if -- whatever else ...
   end repeat
   return MyVar

For sure, you could also use:

Code: Select all

   repeat for each char myChar in MyVar

Should be worlds faster. But, maybe I just misunderstood this all ;-)))

Have fun!

Appendum to Richard:

FourthWorld wrote:CSV is a notoriously FUBAR format.

CSV here is obviously not meant as "Comma-Separated-Values" as you're like to call it:

Havanna wrote:[...] like exchanging commas for dots as data comes from a german system that would use the comma as a decimal point. [...]

Not even we Germans would be nasty enough to use our decimal separator as field separator, too - so we basically use "SSV" instead ("Semicolon-Separated-Values"). ;-)
But since nobody else would know what this may be (.ssv?), we just use "CSV" in the more useful common meaning of "Character-Separated-Values".

FourthWorld · Post by **FourthWorld** » Fri May 20, 2016 5:50 pm

AxWald wrote:Appendum to Richard:
FourthWorld wrote:CSV is a notoriously FUBAR format.
CSV here is obviously not meant as "Comma-Separated-Values" as you're like to call it

It wasn't obvious to me. Commas are one of the most commonly-used delimiters if not the most common, the data being worked on was not shown, and neither "semi-colon" or ";" appears in this thread prior to your post.

How were you able to discern that the data being worked on uses semi-colons as its field delimiter?

Not even we Germans would be nasty enough to use our decimal separator as field separator, too - so we basically use "SSV" instead ("Semicolon-Separated-Values").
But since nobody else would know what this may be (.ssv?), we just use "CSV" in the more useful common meaning of "Character-Separated-Values".

Regardless of the specific delimiter used, the issues of escaping in-data returns and in-data delimiters, the subsequent need to escape the escapes whenever printable characters are used for escapes, and the need to handle the common (though not confoundingly not universal) convention of quoting non-numeric values are common to all delimited tabular data. While commas are perhaps the poorest choice, any delimited data will require handling these considerations in a generalized parser.

jacque · Post by **jacque** » Sat May 21, 2016 6:22 am

AxWald wrote:Hmmm.
Havanna wrote:Normally I put the file into an input variable,
repeat with each Line some conversion,
put the changed line after an output variable,
finally write this output variable back to the file.
This looks very slow. Why don't you just replace in the input variable and use this then for output? Why do you do it line by line? Why not:
Code: Select all
  repeat with i = 1 to Length(MyVar)
      put char i of MyVar into MyChar
      if MyChar = "," then put "." into char i of MyVar
      else if MyChar = Quote then put "'" into char i of MyVar
      else if -- whatever else ...
   end repeat
   return MyVar
For sure, you could also use:
Code: Select all
   repeat for each char myChar in MyVar
Should be worlds faster.

The "repeat with" structure is almost always slower since it has to count from the beginning of the string every time through the loop. "Repeat for" is much faster but in this case you'd still need to build a new list, unless you also want to keep an index counter (which would slow it down because it's counting characters again.) Benchmarks on similar handlers in the past have usually shown that it is faster to build a new list than to alter an existing one, and I generally use the rebuild method myself. But it would be interesting to see if that holds true in this case.

AxWald · Post by **AxWald** » Sat May 21, 2016 6:27 pm

Hi,

FourthWorld wrote:How were you able to discern that the data being worked on uses semi-colons as its field delimiter?

Because Havanna said so (I quoted it), and
from experience. ;-)
To be honest, I've never seen a CSV actually separated with commata, and it's quite some aeons ago that I saw my first. I assume that's because I'm not in the anglo-saxon world, where ppl still use prehistoric measurements and points as decimal dividers ... ;-)

jacque wrote:The "repeat with" structure is almost always slower since it has to count from the beginning of the string every time through the loop.

Oooops - seems I mixed up LC with HC :(
But it's worth a test. Too bad we cannot check the result with MacsBug anymore ;-)

Have fun!

FourthWorld · Post by **FourthWorld** » Sat May 21, 2016 7:46 pm

AxWald wrote:
FourthWorld wrote:How were you able to discern that the data being worked on uses semi-colons as its field delimiter?

Because Havanna said so (I quoted it), and
from experience.
To be honest, I've never seen a CSV actually separated with commata, and it's quite some aeons ago that I saw my first. I assume that's because I'm not in the anglo-saxon world, where ppl still use prehistoric measurements and points as decimal dividers ...

Given that she never mentions semi-colons, and wrote only of numeric separators and not item delimiters, an amazing bit of inference.

@Havanna: it may be helpful if you're in a position to provide a link to the data and an example of the desired output. I can't help but wonder if there's a way to do replacements across the entire data set without using any loops at all.

AxWald · Post by **AxWald** » Tue May 24, 2016 3:38 pm

Hi,

Jacques remark about speed gave me headaches, so I had to test.

At first I exported a table from a database, using a predefined format in HeidiSQL named "Excel CSV": Semicolon as field delimiter, Return as record delimiter, Quotes around all fields. That's what every spreadsheet can read. The data:

Code: Select all

File size: 4.497.398 Bytes; That's 16.912 Records à 38 Fields

Example (3 records):
"172";"30";"55";"1";"A product name    xyz";"";"";"0000055";"6";"3,8500";"";"2014-10-02";"1";"2";"angelegt";"";"";"2014-10-05 19:14:29";"0,00";"0,00";"0,00";"0";"0";"0,00";\N;"0,00";"0,00";"0,00";"0,00";"0,00";"";"0,00";"0";"0";"0";"";"0";"0"

"171";"30";"54";"1";"Another product name    abc";"";"";"0000054";"6";"7,6500";"";"2014-10-02";"1";"1";"angelegt";"";"";"2014-10-05 19:14:02";"0,00";"0,00";"0,00";"0";"0";"0,00";\N;"0,00";"0,00";"0,00";"0,00";"0,00";"";"0,00";"0";"0";"0";"";"0";"0"

"166";"28";"254";"1";"Still a product name    123";"";"";"0000254";"144";"1,2500";"";"2014-10-04";"1";"7";"angelegt";"ermaessigt";"";"2014-10-05 18:57:34";"0,00";"0,00";"0,00";"0";"0";"0,00";\N;"0,00";"0,00";"0,00";"0,00";"0,00";"";"0,00";"0";"0";"0";"";"0";"0"

(I added a Return here between the records for better distinguishing. And I don't know why there's 4 Spaces in the name field, but such happens ...)

At first, I try some simple stuff - it's German data, so the decimal divider is Comma. Let's make it a Point. And while we're at it, let's replace Quotes with Single Quotes:

The traditional way using "repeat with i", HC style.

Code: Select all

   put fld "csv_fld_org" into MyVar
   put 0*1 into MyCounter
   put the milliseconds into MyStart
   repeat with i = 1 to len(MyVar)
      put char i of MyVar into MyChar
      if MyChar = comma then
         put "." into char i of MyVar
         add 1 to MyCounter
      else if MyChar = quote then
         put "'" into char i of MyVar
         add 1 to MyCounter
      end if
   end repeat
   put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
   beep
   put MyVar into fld "csv_fld"

Result:
6.7.10: 1425178 replacements / 2960 ms;
8.0: 1425178 replacements / 13400 ms;
(all time values are rounded, the last non-zero digit can vary a bit)

Now "repeat for each":

Code: Select all

   put fld "csv_fld_org" into MyVar
   put 0*1 into MyCounter
   put the milliseconds into MyStart
   repeat for each char MyChar in MyVar
       if MyChar = comma then
         put "." into MyChar
         add 1 to MyCounter
      else if MyChar = quote then
         put "'" into MyChar
         add 1 to MyCounter
      end if
   end repeat
   put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
   beep
   put MyVar into fld "csv_fld"

Result:
6.7.10: 1425178 replacements / 1970 ms;
8.0: 1425178 replacements / 10100 ms;

Slightly faster. Just - this doesn't change MyVar! It only plays with its own variables ;-) Left as an exercise to the reader to find out why ...

Btw, see the numbers for 8.0! Ouch!

But now I want speed:

Code: Select all

   put fld "csv_fld_org" into MyVar
   put the milliseconds into MyStart
   replace comma with "." in MyVar
   replace quote with "'" in MyVar
   put the milliseconds - MyStart & " ms;"
   beep
   put MyVar into fld "csv_fld"

Result:
6.7.10: 48 ms;
8.0: 32 ms;

This is the way to go! And here 8.0 looks quite well, too.

-----------------------

Here I decided to do some real world tests, and to try the building of an output variable instead of changing the data in place. For this I reduced the data to the first 5000 records.

To start, a little cleanup:

Code: Select all

   put fld "csv_fld_org" into MyVar
   put the milliseconds into MyStart
   replace "\N" with quote & quote in MyVar  --  replace the Null values
   replace quote & return & quote with return in MyVar  --  strip the data
   delete the first char of MyVar
   delete the last char of MyVar
   replace quote & ";" & quote with numtochar(17) in MyVar  --  get nice itemdel
   put the milliseconds - MyStart & " ms;"
   beep
   put MyVar into fld "csv_fld"

Result:
6.7.10: 53 ms;
8.0: 42 ms;

Now that we have clean data, let's go:

At first changing in place:

Code: Select all

   put fld "csv_fld" into MyVar
   put 0*1 into MyCounter
   put the milliseconds into MyStart
   set the itemdel to numtochar(17)
   repeat with i = 1 to the number of lines in MyVar  -- go through the records & mod the fields:
      
      replace "," with "." in item 10 of line i of MyVar  --  a German currency field 1,00 -> 1.00
      
      put killSpaces(item 5 of line i of MyVar) into item 5 of line i of MyVar  -- get rid of unused spaces
      
      put item 12 of line i of MyVar into MyDate  -- date conversion: 2014-10-02 -> 10/02/14
      set the itemdel to "-"
      put item 2 of MyDate & "/" & item 3 of MyDate & "/" & char -2 to -1 of item 1 of myDate into MyDate
      set the itemdel to numtochar(17)
      put MyDate into item 12 of line i of MyVar
      
      add 1 to MyCounter
   end repeat
   replace numtochar(17) with ";" in MyVar  -- set back, maybe you want to check it in a spreadsheet?
   put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
   beep
   put MyVar into fld "csv_fld"

Result:
6.7.10: 5000 replacements / 23500 ms;
8.0: 5000 replacements / 24400 ms;

Btw., the function to kill the spaces:

Code: Select all

function killSpaces MyStr
   repeat until offset("  ", MyStr) = 0
      replace "  " with " " in MyStr
   end repeat
   return MyStr
end killSpaces

But now with building an output variable:

Code: Select all

   put fld "csv_fld" into MyData
   put 0*1 into MyCounter
   put the milliseconds into MyStart
   set the itemdel to numtochar(17)
   repeat with i = 1 to the number of lines in MyData  -- go through the records & mod the fields:
      put line i of MyData into MyVar  -- only working on a copy now!
      
      replace "," with "." in item 10 of MyVar  --  a German currency field 1,00 -> 1.00
      
      put killSpaces(item 5 of MyVar) into item 5 of MyVar  -- get rid of unused spaces
      
      put item 12 of MyVar into MyDate  -- date conversion: 2014-10-02 -> 10/02/14
      set the itemdel to "-"
      put item 2 of MyDate & "/" & item 3 of MyDate & "/" & char -2 to -1 of item 1 of myDate into MyDate
      set the itemdel to numtochar(17)
      put MyDate into item 12 of MyVar
      
      add 1 to MyCounter
      --  put MyVar into line i of MyOutput  --  SLOWER!
      put MyVar & return after MyOutput
   end repeat
   replace numtochar(17) with ";" in MyOutput  -- set back, maybe you want to check it in a spreadsheet?
   put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
   beep
   put MyOutput into fld "csv_fld"

Result:
6.7.10:
put MyVar & return after MyOutput: 5000 replacements / 2960 ms;
put MyVar into line i of MyOutput: 5000 replacements / 10040 ms;

8.0:
put MyVar & return after MyOutput: 5000 replacements / 3980 ms;
put MyVar into line i of MyOutput: 5000 replacements / 10250 ms;

Now this is interesting! Not only that building an output var is so much faster, the difference in how to do it, too!

I learned quite a bit here. Thus I wrote this, maybe it will help others, too.
Seems my old HC habits will have to change, LC is too much different - and should I ever switch to LC 8, I'll have to change again - Arrgghh!

---------------------------

Anyways, re-reading the OP I try now with larger files: 17MB, 4x my initial table.
I add a line "put i" to see the progress, and retry the fastest variable again, with 5000 records first:
6.7.10: 5000 replacements / 9878 ms; 1,9756 ms/record
(3980 ms w/o the "put it")

And now the big file:
Clean-up: 720 ms;
6.7.10: 67648 replacements / 813392 ms; 12,0238 ms/record
Noticeable, indeed. How to circumvent this slowdown?

FourthWorld wrote:I can't help but wonder if there's a way to do replacements across the entire data set without using any loops at all.

Hmmm. Pointer arithmetic? Must make stop. Must think.

Have fun!

AxWald · Post by **AxWald** » Tue May 24, 2016 5:37 pm

Hi,

I knew it. Thinking helps sometimes. New code:

Code: Select all

answer file "Which file to process?"
   put it into MyFile
   open file myFile
   read from file myFile until eof
   put it into myVar
   close file MyFile
   
   ask file "Where to save?"
   put it into MyOutput
   open file myOutput for append
   put 0*1 into myCounter
   put the milliseconds into MyStart
   set the itemdel to numtochar(17)
   
   repeat for each line myLine in myVar
      replace "," with "." in item 10 of MyLine  --  a German currency field 1,00 -> 1.00
      
      put killSpaces(item 5 of myLine) into item 5 of myLine  -- get rid of unused spaces
      
      put item 12 of myLine into MyDate  -- date conversion: 2014-10-02 -> 10/02/14
      set the itemdel to "-"
      put item 2 of MyDate & "/" & item 3 of MyDate & "/" & char -2 to -1 of item 1 of myDate into MyDate
      set the itemdel to numtochar(17)
      put MyDate into item 12 of myLine
      
      replace numtochar(17) with ";" in myLine  -- set back, maybe you want to check it in a spreadsheet?
      
      write myLine & return to file myOutput
      add 1 to myCounter
      put myCounter
   end repeat
   
   close file myOutput   
   put myCounter & " replacements / " & the milliseconds - MyStart & " ms; " & (the milliseconds - MyStart)/myCounter & " ms/record"
   beep

Old times from last post:
5000 replacements / 9878 ms; 1,9756 ms/record
67648 replacements / 813392 ms; 12,0238 ms/record

New times, with direct output to a file:
5000 replacements / 18351 ms; 3.6702 ms/record
67648 replacements / 250805 ms; 3.707501 ms/record

Voilà! Degradation removed! \o/ \o/ \o/
Even if we're slower with small data sets the speed remains equal with more serious workload. And we only need a small bit of RAM (below 100MB here, Win 10-64, LC 6.7.10).

Now testing with 8.0:
5000 replacements / 23386 ms; 4.6772 ms/record
67648 replacements / 315715 ms; 4.667026 ms/record
Again consistent results.

Btw., HD speed doesn't matter. I tested both on SSD and my slowest HD, no difference. You just should restart LC between tests, it seems to do some caching:
5000 replacements / 5945 ms; 1.189 ms/record instead of the before:
5000 replacements / 18351 ms; 3.6702 ms/record ...

Without looping - didn't find a way. As long as I have such kind of work to do, with fields & records. Another idea - arrays? Maybe some brave array-wiz wants to try it?

Have fun!

Thierry · Post by **Thierry** » Tue May 24, 2016 6:55 pm

Code: Select all

      put myCounter
   end repeat

Hi,

I had a very quick look at your code...

First, comment your "put myCounter" and *THIS* will speed up your loop.

Second, I think there is a slight error in your date conversion (item 12).
You need to unquote and quote again the date.

HTH,

Thierry

LiveCode Forums

Performance badly degrading with growing string variable

Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable

Re: Performance badly degrading with growing string variable