Performance badly degrading with growing string variable
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller
Performance badly degrading with growing string variable
With LC 8.0 Indy (and older versions)
I often have to change values in large (>12 MB) csv files.
Normally I put the file into an input variable,
repeat with each Line some conversion,
put the changed line after an output variable,
finally write this output variable back to the file.
This works just fine, but processing time skyrockets with the number of added output lines:
A simple test routine verifies this:
On my machine adding the first 10000 lines to a string takes about 13 ms, lines 200000 to 210000 uses up half a second and 490000 to 500000 is nearly 1,5 sec.
With some other processing this is adding up (in the actual stack to a running time of nearly half an hour).
I hope someone has an easy way around this problem
I often have to change values in large (>12 MB) csv files.
Normally I put the file into an input variable,
repeat with each Line some conversion,
put the changed line after an output variable,
finally write this output variable back to the file.
This works just fine, but processing time skyrockets with the number of added output lines:
A simple test routine verifies this:
On my machine adding the first 10000 lines to a string takes about 13 ms, lines 200000 to 210000 uses up half a second and 490000 to 500000 is nearly 1,5 sec.
With some other processing this is adding up (in the actual stack to a running time of nearly half an hour).
I hope someone has an easy way around this problem
-
- VIP Livecode Opensource Backer
- Posts: 9834
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Performance badly degrading with growing string variable
Is this timing different than with v7?
And can you post your code? We've had very good luck working together here to optimize things.
And can you post your code? We've had very good luck working together here to optimize things.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Performance badly degrading with growing string variable
Performance seems to be slightly better in 8.
A test stack would have a scrolling field named "f_show"
and a button to run the test with the following script
A test stack would have a scrolling field named "f_show"
and a button to run the test with the following script
Code: Select all
on mouseUp
put 10000 into tChunkSize
put 300000 into tMaxIterations
put 0 into tCounter
put 0 into tTeller
put milliseconds() into tNow
put "Milliseconds used for " & tChunkSize & " iterations" & cr into field f_Show
repeat with i = 1 to tMaxIterations
put "The quick brown fox jumps over the lazy dog" & cr after tBigData
add 1 to tcounter
if tcounter > tChunkSize then
add tChunkSize to tTeller
put (milliseconds() - tNow) & ": " & tTeller & cr after field f_Show
put milliseconds() into tNow
put 0 into tCounter
end if
end repeat
put (milliseconds() - tNow) & ": " & i after field f_show
end mouseUp
-
- VIP Livecode Opensource Backer
- Posts: 9834
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Performance badly degrading with growing string variable
in repeat operations you'll normally see a rather huge speed boost by first taking the data out of a field object, working on it in a variable within the loop, and then putting it into the field for display when you're done.
In your case, it seems you're logging time increments as you go. Could still be done in a variable,though - do you need to see that output as it goes.
Also, clearly that's sample data, since the algo would seem to have no value other than to measure itself. To help find the best solution for your needs it would be helpful to better understand the problem you're solving. What do you need to do?
In your case, it seems you're logging time increments as you go. Could still be done in a variable,though - do you need to see that output as it goes.
Also, clearly that's sample data, since the algo would seem to have no value other than to measure itself. To help find the best solution for your needs it would be helpful to better understand the problem you're solving. What do you need to do?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Performance badly degrading with growing string variable
Sorry, I may not have been clear enough:
The job requires a number of csv-files (about 80 MB) to be put into an SqLite database.
this means to translate data to SQL and on the fly do some conversions, like exchanging commas for dots as data comes from a german system that would use the comma as a decimal point. Other Changes have to happen, so the whole skript gets rather complicated.
It takes a long time to run and I could pin this down to the string growing.
As you noticed my example here does nothig but add one line to the string, over and over.
The time measuring is only for this testing situation and does not significantly affect the timing problem.
But when you run it you can really see processing time going up terribly - weird, and I wonder ...
The job requires a number of csv-files (about 80 MB) to be put into an SqLite database.
this means to translate data to SQL and on the fly do some conversions, like exchanging commas for dots as data comes from a german system that would use the comma as a decimal point. Other Changes have to happen, so the whole skript gets rather complicated.
It takes a long time to run and I could pin this down to the string growing.
As you noticed my example here does nothig but add one line to the string, over and over.
The time measuring is only for this testing situation and does not significantly affect the timing problem.
But when you run it you can really see processing time going up terribly - weird, and I wonder ...
-
- VIP Livecode Opensource Backer
- Posts: 9834
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Performance badly degrading with growing string variable
CSV is a notoriously FUBAR format. Fortunately one of our community members, Alex Tweedly, came up with a nice algo that parses it with reasonably good efficiency into a format that lends itself much better to working with - see the middle of this page:
http://www.fourthworld.com/embassy/arti ... t-die.html
Than handler returns the data in a format where fields are separated by tabs, lines with CR, and by default it replaces in-data CRs with ASCII 11 (though there are params to use a different character if needed).
To import your data into SQLite I'd suggest:
1. Get the list of files'
2. For each, read it in,
2.1 Pass it through CsvToTab to get a format that's efficient to work with
2.2 Make adjustments to that format for import as needed for SQL
2.3 Do the insert between OPEN TRANSACTION and CLOSE TRANSACTION SQL statements
Tips to keep things efficient:
1. Constantly adding data to a field as shown doesn't do anything for the processing of the files and only slows things down progressively, as you've seen, since the data gets longer and longer.
2. Showing progress can be useful, but any indication of progress takes time away from the data processing so you'll want to keep those to the minimum needed.
3. If each file processes quickly enough, your progress indicator should just show files, e.g.:
4. Where possible, use "repeat for each line..." rather than "repeat with i =..." for handling loops, as the former does a smarter job of keeping track of where it is and parsing at is goes, while the latter needs to start at the beginning of the chunk in each iteration and count i number of lines to get to the one to be worked on.
Hopefully that outline and those tips will speed things up for you by at least an order of magnitude. If not please consider posting the full code here so we can help optimize it for you.
This thread is long but perhaps a good read, a discussion of an algo that originally took 9 minutes ultimately brought down to just a few millisecs, and along the way explaining each of the alterations to the algo to achieve those speeds:
http://forums.livecode.com/viewtopic.php?f=8&t=24945
http://www.fourthworld.com/embassy/arti ... t-die.html
Than handler returns the data in a format where fields are separated by tabs, lines with CR, and by default it replaces in-data CRs with ASCII 11 (though there are params to use a different character if needed).
To import your data into SQLite I'd suggest:
1. Get the list of files'
2. For each, read it in,
2.1 Pass it through CsvToTab to get a format that's efficient to work with
2.2 Make adjustments to that format for import as needed for SQL
2.3 Do the insert between OPEN TRANSACTION and CLOSE TRANSACTION SQL statements
Tips to keep things efficient:
1. Constantly adding data to a field as shown doesn't do anything for the processing of the files and only slows things down progressively, as you've seen, since the data gets longer and longer.
2. Showing progress can be useful, but any indication of progress takes time away from the data processing so you'll want to keep those to the minimum needed.
3. If each file processes quickly enough, your progress indicator should just show files, e.g.:
Code: Select all
put "Processing file "& i &" of "& tNumberOfFiles into fld "Progress"
Hopefully that outline and those tips will speed things up for you by at least an order of magnitude. If not please consider posting the full code here so we can help optimize it for you.
This thread is long but perhaps a good read, a discussion of an algo that originally took 9 minutes ultimately brought down to just a few millisecs, and along the way explaining each of the alterations to the algo to achieve those speeds:
http://forums.livecode.com/viewtopic.php?f=8&t=24945
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Performance badly degrading with growing string variable
Thanks for all your considerations, Richard!
The actual processing algo isn't the problem, as it includes TRANSACTION, repeat for each …
It really comes down to "put somting after some string" taking terribly long with bigger strings.
In this teststack, writing 500000 dummy lines costs almost 36 seconds
I finally got around this by doing it in smaller chunks:
append Data to a string tBigDataPart,
after a bunch of Lines append tBigDataPart to tBigData reduces total time of this test drastically:
with chunks of 1000 Lines to 3.5 seconds
with chunks of 10000 Lines to 0.8 seconds
with chunks of 20000 Lines to 1.2 seconds
The optimum number varies with the lenght of Lines to add.
The actual processing algo isn't the problem, as it includes TRANSACTION, repeat for each …
It really comes down to "put somting after some string" taking terribly long with bigger strings.
In this teststack, writing 500000 dummy lines costs almost 36 seconds
I finally got around this by doing it in smaller chunks:
append Data to a string tBigDataPart,
after a bunch of Lines append tBigDataPart to tBigData reduces total time of this test drastically:
with chunks of 1000 Lines to 3.5 seconds
with chunks of 10000 Lines to 0.8 seconds
with chunks of 20000 Lines to 1.2 seconds
The optimum number varies with the lenght of Lines to add.
Re: Performance badly degrading with growing string variable
Hmmm.
For sure, you could also use:
Should be worlds faster. But, maybe I just misunderstood this all ;-)))
Have fun!
Appendum to Richard:
But since nobody else would know what this may be (.ssv?), we just use "CSV" in the more useful common meaning of "Character-Separated-Values".
This looks very slow. Why don't you just replace in the input variable and use this then for output? Why do you do it line by line? Why not:Havanna wrote:Normally I put the file into an input variable,
repeat with each Line some conversion,
put the changed line after an output variable,
finally write this output variable back to the file.
Code: Select all
repeat with i = 1 to Length(MyVar)
put char i of MyVar into MyChar
if MyChar = "," then put "." into char i of MyVar
else if MyChar = Quote then put "'" into char i of MyVar
else if -- whatever else ...
end repeat
return MyVar
Code: Select all
repeat for each char myChar in MyVar
Have fun!
Appendum to Richard:
CSV here is obviously not meant as "Comma-Separated-Values" as you're like to call it:FourthWorld wrote:CSV is a notoriously FUBAR format.
Not even we Germans would be nasty enough to use our decimal separator as field separator, too - so we basically use "SSV" instead ("Semicolon-Separated-Values"). ;-)Havanna wrote:[...] like exchanging commas for dots as data comes from a german system that would use the comma as a decimal point. [...]
But since nobody else would know what this may be (.ssv?), we just use "CSV" in the more useful common meaning of "Character-Separated-Values".
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
-
- VIP Livecode Opensource Backer
- Posts: 9834
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Performance badly degrading with growing string variable
It wasn't obvious to me. Commas are one of the most commonly-used delimiters if not the most common, the data being worked on was not shown, and neither "semi-colon" or ";" appears in this thread prior to your post.AxWald wrote:Appendum to Richard:CSV here is obviously not meant as "Comma-Separated-Values" as you're like to call itFourthWorld wrote:CSV is a notoriously FUBAR format.
How were you able to discern that the data being worked on uses semi-colons as its field delimiter?
Regardless of the specific delimiter used, the issues of escaping in-data returns and in-data delimiters, the subsequent need to escape the escapes whenever printable characters are used for escapes, and the need to handle the common (though not confoundingly not universal) convention of quoting non-numeric values are common to all delimited tabular data. While commas are perhaps the poorest choice, any delimited data will require handling these considerations in a generalized parser.Not even we Germans would be nasty enough to use our decimal separator as field separator, too - so we basically use "SSV" instead ("Semicolon-Separated-Values").
But since nobody else would know what this may be (.ssv?), we just use "CSV" in the more useful common meaning of "Character-Separated-Values".
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
-
- VIP Livecode Opensource Backer
- Posts: 7233
- Joined: Sat Apr 08, 2006 8:31 pm
- Location: Minneapolis MN
- Contact:
Re: Performance badly degrading with growing string variable
The "repeat with" structure is almost always slower since it has to count from the beginning of the string every time through the loop. "Repeat for" is much faster but in this case you'd still need to build a new list, unless you also want to keep an index counter (which would slow it down because it's counting characters again.) Benchmarks on similar handlers in the past have usually shown that it is faster to build a new list than to alter an existing one, and I generally use the rebuild method myself. But it would be interesting to see if that holds true in this case.AxWald wrote:Hmmm.This looks very slow. Why don't you just replace in the input variable and use this then for output? Why do you do it line by line? Why not:Havanna wrote:Normally I put the file into an input variable,
repeat with each Line some conversion,
put the changed line after an output variable,
finally write this output variable back to the file.
For sure, you could also use:Code: Select all
repeat with i = 1 to Length(MyVar) put char i of MyVar into MyChar if MyChar = "," then put "." into char i of MyVar else if MyChar = Quote then put "'" into char i of MyVar else if -- whatever else ... end repeat return MyVar
Should be worlds faster.Code: Select all
repeat for each char myChar in MyVar
Jacqueline Landman Gay | jacque at hyperactivesw dot com
HyperActive Software | http://www.hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com
Re: Performance badly degrading with growing string variable
Hi,
But it's worth a test. Too bad we cannot check the result with MacsBug anymore ;-)
Have fun!
FourthWorld wrote:How were you able to discern that the data being worked on uses semi-colons as its field delimiter?
- Because Havanna said so (I quoted it), and
- from experience. ;-)
To be honest, I've never seen a CSV actually separated with commata, and it's quite some aeons ago that I saw my first. I assume that's because I'm not in the anglo-saxon world, where ppl still use prehistoric measurements and points as decimal dividers ... ;-)
Oooops - seems I mixed up LC with HC :(jacque wrote:The "repeat with" structure is almost always slower since it has to count from the beginning of the string every time through the loop.
But it's worth a test. Too bad we cannot check the result with MacsBug anymore ;-)
Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
-
- VIP Livecode Opensource Backer
- Posts: 9834
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Performance badly degrading with growing string variable
Given that she never mentions semi-colons, and wrote only of numeric separators and not item delimiters, an amazing bit of inference.AxWald wrote:FourthWorld wrote:How were you able to discern that the data being worked on uses semi-colons as its field delimiter?
- Because Havanna said so (I quoted it), and
- from experience.
To be honest, I've never seen a CSV actually separated with commata, and it's quite some aeons ago that I saw my first. I assume that's because I'm not in the anglo-saxon world, where ppl still use prehistoric measurements and points as decimal dividers ...
@Havanna: it may be helpful if you're in a position to provide a link to the data and an example of the desired output. I can't help but wonder if there's a way to do replacements across the entire data set without using any loops at all.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: Performance badly degrading with growing string variable
Hi,
Jacques remark about speed gave me headaches, so I had to test.
At first I exported a table from a database, using a predefined format in HeidiSQL named "Excel CSV": Semicolon as field delimiter, Return as record delimiter, Quotes around all fields. That's what every spreadsheet can read. The data:
(I added a Return here between the records for better distinguishing. And I don't know why there's 4 Spaces in the name field, but such happens ...)
At first, I try some simple stuff - it's German data, so the decimal divider is Comma. Let's make it a Point. And while we're at it, let's replace Quotes with Single Quotes:
The traditional way using "repeat with i", HC style.Result:
6.7.10: 1425178 replacements / 2960 ms;
8.0: 1425178 replacements / 13400 ms;
(all time values are rounded, the last non-zero digit can vary a bit)
Now "repeat for each":Result:
6.7.10: 1425178 replacements / 1970 ms;
8.0: 1425178 replacements / 10100 ms;
Slightly faster. Just - this doesn't change MyVar! It only plays with its own variables ;-) Left as an exercise to the reader to find out why ...
Btw, see the numbers for 8.0! Ouch!
But now I want speed:Result:
6.7.10: 48 ms;
8.0: 32 ms;
This is the way to go! And here 8.0 looks quite well, too.
-----------------------
Here I decided to do some real world tests, and to try the building of an output variable instead of changing the data in place. For this I reduced the data to the first 5000 records.
To start, a little cleanup:Result:
6.7.10: 53 ms;
8.0: 42 ms;
Now that we have clean data, let's go:
At first changing in place:Result:
6.7.10: 5000 replacements / 23500 ms;
8.0: 5000 replacements / 24400 ms;
Btw., the function to kill the spaces:
But now with building an output variable:
Result:
6.7.10:
put MyVar & return after MyOutput: 5000 replacements / 2960 ms;
put MyVar into line i of MyOutput: 5000 replacements / 10040 ms;
8.0:
put MyVar & return after MyOutput: 5000 replacements / 3980 ms;
put MyVar into line i of MyOutput: 5000 replacements / 10250 ms;
Now this is interesting! Not only that building an output var is so much faster, the difference in how to do it, too!
I learned quite a bit here. Thus I wrote this, maybe it will help others, too.
Seems my old HC habits will have to change, LC is too much different - and should I ever switch to LC 8, I'll have to change again - Arrgghh!
---------------------------
Anyways, re-reading the OP I try now with larger files: 17MB, 4x my initial table.
I add a line "put i" to see the progress, and retry the fastest variable again, with 5000 records first:
6.7.10: 5000 replacements / 9878 ms; 1,9756 ms/record
(3980 ms w/o the "put it")
And now the big file:
Clean-up: 720 ms;
6.7.10: 67648 replacements / 813392 ms; 12,0238 ms/record
Noticeable, indeed. How to circumvent this slowdown?
Have fun!
Jacques remark about speed gave me headaches, so I had to test.
At first I exported a table from a database, using a predefined format in HeidiSQL named "Excel CSV": Semicolon as field delimiter, Return as record delimiter, Quotes around all fields. That's what every spreadsheet can read. The data:
Code: Select all
File size: 4.497.398 Bytes; That's 16.912 Records à 38 Fields
Example (3 records):
"172";"30";"55";"1";"A product name xyz";"";"";"0000055";"6";"3,8500";"";"2014-10-02";"1";"2";"angelegt";"";"";"2014-10-05 19:14:29";"0,00";"0,00";"0,00";"0";"0";"0,00";\N;"0,00";"0,00";"0,00";"0,00";"0,00";"";"0,00";"0";"0";"0";"";"0";"0"
"171";"30";"54";"1";"Another product name abc";"";"";"0000054";"6";"7,6500";"";"2014-10-02";"1";"1";"angelegt";"";"";"2014-10-05 19:14:02";"0,00";"0,00";"0,00";"0";"0";"0,00";\N;"0,00";"0,00";"0,00";"0,00";"0,00";"";"0,00";"0";"0";"0";"";"0";"0"
"166";"28";"254";"1";"Still a product name 123";"";"";"0000254";"144";"1,2500";"";"2014-10-04";"1";"7";"angelegt";"ermaessigt";"";"2014-10-05 18:57:34";"0,00";"0,00";"0,00";"0";"0";"0,00";\N;"0,00";"0,00";"0,00";"0,00";"0,00";"";"0,00";"0";"0";"0";"";"0";"0"
At first, I try some simple stuff - it's German data, so the decimal divider is Comma. Let's make it a Point. And while we're at it, let's replace Quotes with Single Quotes:
The traditional way using "repeat with i", HC style.
Code: Select all
put fld "csv_fld_org" into MyVar
put 0*1 into MyCounter
put the milliseconds into MyStart
repeat with i = 1 to len(MyVar)
put char i of MyVar into MyChar
if MyChar = comma then
put "." into char i of MyVar
add 1 to MyCounter
else if MyChar = quote then
put "'" into char i of MyVar
add 1 to MyCounter
end if
end repeat
put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
beep
put MyVar into fld "csv_fld"
6.7.10: 1425178 replacements / 2960 ms;
8.0: 1425178 replacements / 13400 ms;
(all time values are rounded, the last non-zero digit can vary a bit)
Now "repeat for each":
Code: Select all
put fld "csv_fld_org" into MyVar
put 0*1 into MyCounter
put the milliseconds into MyStart
repeat for each char MyChar in MyVar
if MyChar = comma then
put "." into MyChar
add 1 to MyCounter
else if MyChar = quote then
put "'" into MyChar
add 1 to MyCounter
end if
end repeat
put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
beep
put MyVar into fld "csv_fld"
6.7.10: 1425178 replacements / 1970 ms;
8.0: 1425178 replacements / 10100 ms;
Slightly faster. Just - this doesn't change MyVar! It only plays with its own variables ;-) Left as an exercise to the reader to find out why ...
Btw, see the numbers for 8.0! Ouch!
But now I want speed:
Code: Select all
put fld "csv_fld_org" into MyVar
put the milliseconds into MyStart
replace comma with "." in MyVar
replace quote with "'" in MyVar
put the milliseconds - MyStart & " ms;"
beep
put MyVar into fld "csv_fld"
6.7.10: 48 ms;
8.0: 32 ms;
This is the way to go! And here 8.0 looks quite well, too.
-----------------------
Here I decided to do some real world tests, and to try the building of an output variable instead of changing the data in place. For this I reduced the data to the first 5000 records.
To start, a little cleanup:
Code: Select all
put fld "csv_fld_org" into MyVar
put the milliseconds into MyStart
replace "\N" with quote & quote in MyVar -- replace the Null values
replace quote & return & quote with return in MyVar -- strip the data
delete the first char of MyVar
delete the last char of MyVar
replace quote & ";" & quote with numtochar(17) in MyVar -- get nice itemdel
put the milliseconds - MyStart & " ms;"
beep
put MyVar into fld "csv_fld"
6.7.10: 53 ms;
8.0: 42 ms;
Now that we have clean data, let's go:
At first changing in place:
Code: Select all
put fld "csv_fld" into MyVar
put 0*1 into MyCounter
put the milliseconds into MyStart
set the itemdel to numtochar(17)
repeat with i = 1 to the number of lines in MyVar -- go through the records & mod the fields:
replace "," with "." in item 10 of line i of MyVar -- a German currency field 1,00 -> 1.00
put killSpaces(item 5 of line i of MyVar) into item 5 of line i of MyVar -- get rid of unused spaces
put item 12 of line i of MyVar into MyDate -- date conversion: 2014-10-02 -> 10/02/14
set the itemdel to "-"
put item 2 of MyDate & "/" & item 3 of MyDate & "/" & char -2 to -1 of item 1 of myDate into MyDate
set the itemdel to numtochar(17)
put MyDate into item 12 of line i of MyVar
add 1 to MyCounter
end repeat
replace numtochar(17) with ";" in MyVar -- set back, maybe you want to check it in a spreadsheet?
put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
beep
put MyVar into fld "csv_fld"
6.7.10: 5000 replacements / 23500 ms;
8.0: 5000 replacements / 24400 ms;
Btw., the function to kill the spaces:
Code: Select all
function killSpaces MyStr
repeat until offset(" ", MyStr) = 0
replace " " with " " in MyStr
end repeat
return MyStr
end killSpaces
Code: Select all
put fld "csv_fld" into MyData
put 0*1 into MyCounter
put the milliseconds into MyStart
set the itemdel to numtochar(17)
repeat with i = 1 to the number of lines in MyData -- go through the records & mod the fields:
put line i of MyData into MyVar -- only working on a copy now!
replace "," with "." in item 10 of MyVar -- a German currency field 1,00 -> 1.00
put killSpaces(item 5 of MyVar) into item 5 of MyVar -- get rid of unused spaces
put item 12 of MyVar into MyDate -- date conversion: 2014-10-02 -> 10/02/14
set the itemdel to "-"
put item 2 of MyDate & "/" & item 3 of MyDate & "/" & char -2 to -1 of item 1 of myDate into MyDate
set the itemdel to numtochar(17)
put MyDate into item 12 of MyVar
add 1 to MyCounter
-- put MyVar into line i of MyOutput -- SLOWER!
put MyVar & return after MyOutput
end repeat
replace numtochar(17) with ";" in MyOutput -- set back, maybe you want to check it in a spreadsheet?
put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
beep
put MyOutput into fld "csv_fld"
6.7.10:
put MyVar & return after MyOutput: 5000 replacements / 2960 ms;
put MyVar into line i of MyOutput: 5000 replacements / 10040 ms;
8.0:
put MyVar & return after MyOutput: 5000 replacements / 3980 ms;
put MyVar into line i of MyOutput: 5000 replacements / 10250 ms;
Now this is interesting! Not only that building an output var is so much faster, the difference in how to do it, too!
I learned quite a bit here. Thus I wrote this, maybe it will help others, too.
Seems my old HC habits will have to change, LC is too much different - and should I ever switch to LC 8, I'll have to change again - Arrgghh!
---------------------------
Anyways, re-reading the OP I try now with larger files: 17MB, 4x my initial table.
I add a line "put i" to see the progress, and retry the fastest variable again, with 5000 records first:
6.7.10: 5000 replacements / 9878 ms; 1,9756 ms/record
(3980 ms w/o the "put it")
And now the big file:
Clean-up: 720 ms;
6.7.10: 67648 replacements / 813392 ms; 12,0238 ms/record
Noticeable, indeed. How to circumvent this slowdown?
Hmmm. Pointer arithmetic? Must make stop. Must think.FourthWorld wrote:I can't help but wonder if there's a way to do replacements across the entire data set without using any loops at all.
Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
Re: Performance badly degrading with growing string variable
Hi,
I knew it. Thinking helps sometimes. New code:
Old times from last post:
5000 replacements / 9878 ms; 1,9756 ms/record
67648 replacements / 813392 ms; 12,0238 ms/record
New times, with direct output to a file:
5000 replacements / 18351 ms; 3.6702 ms/record
67648 replacements / 250805 ms; 3.707501 ms/record
Voilà! Degradation removed! \o/ \o/ \o/
Even if we're slower with small data sets the speed remains equal with more serious workload. And we only need a small bit of RAM (below 100MB here, Win 10-64, LC 6.7.10).
Now testing with 8.0:
5000 replacements / 23386 ms; 4.6772 ms/record
67648 replacements / 315715 ms; 4.667026 ms/record
Again consistent results.
Btw., HD speed doesn't matter. I tested both on SSD and my slowest HD, no difference. You just should restart LC between tests, it seems to do some caching:
5000 replacements / 5945 ms; 1.189 ms/record instead of the before:
5000 replacements / 18351 ms; 3.6702 ms/record ...
Without looping - didn't find a way. As long as I have such kind of work to do, with fields & records. Another idea - arrays? Maybe some brave array-wiz wants to try it?
Have fun!
I knew it. Thinking helps sometimes. New code:
Code: Select all
answer file "Which file to process?"
put it into MyFile
open file myFile
read from file myFile until eof
put it into myVar
close file MyFile
ask file "Where to save?"
put it into MyOutput
open file myOutput for append
put 0*1 into myCounter
put the milliseconds into MyStart
set the itemdel to numtochar(17)
repeat for each line myLine in myVar
replace "," with "." in item 10 of MyLine -- a German currency field 1,00 -> 1.00
put killSpaces(item 5 of myLine) into item 5 of myLine -- get rid of unused spaces
put item 12 of myLine into MyDate -- date conversion: 2014-10-02 -> 10/02/14
set the itemdel to "-"
put item 2 of MyDate & "/" & item 3 of MyDate & "/" & char -2 to -1 of item 1 of myDate into MyDate
set the itemdel to numtochar(17)
put MyDate into item 12 of myLine
replace numtochar(17) with ";" in myLine -- set back, maybe you want to check it in a spreadsheet?
write myLine & return to file myOutput
add 1 to myCounter
put myCounter
end repeat
close file myOutput
put myCounter & " replacements / " & the milliseconds - MyStart & " ms; " & (the milliseconds - MyStart)/myCounter & " ms/record"
beep
5000 replacements / 9878 ms; 1,9756 ms/record
67648 replacements / 813392 ms; 12,0238 ms/record
New times, with direct output to a file:
5000 replacements / 18351 ms; 3.6702 ms/record
67648 replacements / 250805 ms; 3.707501 ms/record
Voilà! Degradation removed! \o/ \o/ \o/
Even if we're slower with small data sets the speed remains equal with more serious workload. And we only need a small bit of RAM (below 100MB here, Win 10-64, LC 6.7.10).
Now testing with 8.0:
5000 replacements / 23386 ms; 4.6772 ms/record
67648 replacements / 315715 ms; 4.667026 ms/record
Again consistent results.
Btw., HD speed doesn't matter. I tested both on SSD and my slowest HD, no difference. You just should restart LC between tests, it seems to do some caching:
5000 replacements / 5945 ms; 1.189 ms/record instead of the before:
5000 replacements / 18351 ms; 3.6702 ms/record ...
Without looping - didn't find a way. As long as I have such kind of work to do, with fields & records. Another idea - arrays? Maybe some brave array-wiz wants to try it?
Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!
Re: Performance badly degrading with growing string variable
Code: Select all
put myCounter
end repeat
I had a very quick look at your code...
First, comment your "put myCounter" and *THIS* will speed up your loop.
Second, I think there is a slight error in your date conversion (item 12).
You need to unquote and quote again the date.
HTH,
Thierry
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!