Fixed length to delimted - is there a faster way?

Anything beyond the basics in using the LiveCode language. Share your handlers, functions and magic here.

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

Post Reply
edgore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 197
Joined: Wed Jun 14, 2006 8:40 pm

Fixed length to delimted - is there a faster way?

Post by edgore » Tue Sep 06, 2011 7:09 pm

I have a very large (tens of thousands of lines) text file generated by a mainframe reporting system. The data is fixed length, so each element (11 per line) is separated by spaces, and the number of spaces between elements varies on each line, depending on the length of each element.

Right now I am going through line by line and parsing the data by inserting delimiters into each line based on the fixed lengths of each element and handing each item off to a function to remove trailing spaces (some data elements contain spaces, so I can't just replace them with empy in the container)

It works, but it's much slower than I would like. I am wondering if there is any clever way to avoid handling each line individually when inserting the delimiters. is there some trick I could use, for example to just say that by fiat for every line of a container chars 1-10 are item, chars 11-14 are item 2, etc.? I couldn't think of one.

Thanks!

sturgis
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 1685
Joined: Sat Feb 28, 2009 11:49 pm

Re: Fixed length to delimted - is there a faster way?

Post by sturgis » Tue Sep 06, 2011 8:02 pm

Just tried the following. Takes about 3500 milliseconds to process over 800k lines. I tried doing the replacetext outside the loop on the whole 800k lines but it adds a full second to processing. I also tried to skip replacetext entirely by using funkychunking such as
"put word 1 to -1 of (char 1 to 10 of tLine) & comma & word 1 to -1 of (char 11 to 20 of tLine) & comma & word 1 to -1 of (char 21 to -1 of tLIne)". While it worked, it ended up behing horrendously slow. (38 seconds or so for the same data)

Code: Select all

on mouseUp
   put the milliseconds into tStart --track time to process
   put field 1 into tDat -- put the data out of my field into a working var

   --just a string to output at the end to the message box
   put "It took [[the milliseconds - tStart ]] milliseconds to process [[ the number of lines in tDat]] lines of data" into tResultLine

  --repeat through the data using for each method (its faster than other options)
   repeat for each line tLine in tDat

  -- I only used 3 columns. There might be a faster way to break this up, not sure.
  -- using comma as delimiter
      put (char 1 to 10 of tLine) & comma & (char 11 to 20 of tLine) & comma & (char 21 to -1 of tLine) & return into tNewLine

  --process the line using regex replacetext. Dumps any number of spaces followed by comma,
  --and replaces with just a comma. Then puts adds it to tNewDat variable
      put replacetext(tNewLine, " *," ,",") after tNewDat
   end repeat

  --removes the trailing return from tNewDat
   delete the last char of tNewDat

  -- drops the new data into my 2nd field. 
   put tNewDat into field 2
  
  --puts the results line with values merged in. 
 put merge(tResultLine)
end mouseUp

Dixie
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 1336
Joined: Sun Jul 12, 2009 10:53 am

Re: Fixed length to delimted - is there a faster way?

Post by Dixie » Tue Sep 06, 2011 8:20 pm

Hi..

What is the minimum and maximum possible length of each of the 11 possible elements on one line ?

Dixie

edgore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 197
Joined: Wed Jun 14, 2006 8:40 pm

Re: Fixed length to delimted - is there a faster way?

Post by edgore » Tue Sep 06, 2011 8:51 pm

I hope this makes sense...

1: either 0 or 8
2: either 0 or 7 but padded with leading zeros I have to remove
3: 0-20
4: 1-8
5: either 0 or 4
6: 4-6
7: 6 or 7
8: 0-20
9: 4-6
10: 0-21
and I lied...there are only 10 elements.

It looks like moving the data from one variable to another is much faster - I was assuming that because there was so much data I was better off saving memory and doing everything within one variable. Clearly, I was wrong. I also need to refresh myself on how the whole '1 to -1" thing works.

I will fool around with the suggestions you have made - I think it will significant;y reduce the processing time

edgore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 197
Joined: Wed Jun 14, 2006 8:40 pm

Re: Fixed length to delimted - is there a faster way?

Post by edgore » Tue Sep 06, 2011 9:51 pm

Okay, so it's down from minutes to .2 seconds. I think I am satisfied with that.

Thanks for your help. The problem was that, as I usually do I erred on the side of readability vs. compactness. Broke statements out into more lines than needed, used terms like "the last char of" rather than "char -1" etc. I also tried to conserve memory by using " repeat with x=" and a single variable rather than repeat with each and using a temporay variable. All of these things were working against me instead of for me.

Lessons learned.

sturgis
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 1685
Joined: Sat Feb 28, 2009 11:49 pm

Re: Fixed length to delimted - is there a faster way?

Post by sturgis » Tue Sep 06, 2011 11:24 pm

Of all the things you listed, the biggest impact was most likely using the "repeat with x = 1 to....." type of repeat loop. Unless its been changed in one of the multitude of recent updates, its MUCH faster to use "repeat for each".

If I recall, its because each time you increment your counter it has to actually index from the first line of the variable down to the current index line so as x increases the loop speed decreases. Repeat for each doesn't. I think "last char of" vs "char -1" is a null issue.

If memory were to become an issue, i'm sure it could be worked around. The 800k+ lines I tested with probably weren't enough to worry about on a modern machine, but I have an antique around here that it might have choked on. 800k+ lines in my field, same in a variable, same at the end in my working variable, and the same amount shoved into my 2nd field.

Hmm, minor hijack here, but.. if I have a temporary variable (IE one only used inside a handler) at the end of the handler the memory it used is auto-magically freed up right? Meaning there would be no benefit to manually emptying it before handler end?

Post Reply