Get the last two columns

Klaus · Post by **Klaus** » Wed Feb 26, 2025 2:20 pm

Doesn't work, see first posting!

Unfortunately LC doesn't get the last columns like that

richmond62 · Post by **richmond62** » Wed Feb 26, 2025 3:18 pm

So: obviously itemDelimiter and columnDelimiter work differently.

bogs · Post by **bogs** » Wed Feb 26, 2025 3:32 pm

SparkOut wrote: ↑
Tue Feb 25, 2025 8:20 pm
Hey bogs, why would you update the field like that in the loop?
A) field uodates are a hefty overhead
B) you have two statements that put data into an indexed line, which means that twice within each loop the engine has to count through to the index to update each.
Surely a more comparative test would be to put the extracted data "after" or even "before" a variable, then drop that data into the field in one go.
Notwithstanding, the array split is a brilliant method, but relies on data being consistently sane.
Anyone producing insane data for others to manipulate surely is insane, or a sadist.

Mostly the 2nd code listed was a mistaken copy/paste, however, even when I corrected it, there was little to no difference in the speed. Here is the *correct* code that should have been put in that post:

Code: Select all

on mouseUp
   put empty into field 1; put empty into field 2; put empty into field 3
   --lock screen
   put the seconds into tmpStart
   
   put url("file:/home/b/Desktop/LiveCodeProjects/exampleProjects/largeCsvDatasetHandling/Electric_Vehicle_Population_Data.csv") into tData
   set the itemDelimiter to comma
   put 1 into theLine
   repeat for each line x in tData
      put theLine & " of  " & the number of lines of tData /* this opens the message box to tell you how far along you are, 
                                                                                           as the amount of time it was taking made me think my machine 
                                                                                          had locked or the program test had frozen */
      put item 6 of x into line theline of tmpFld2
      put item 7 of x into line theline of tmpFld3
      add 1 to theLine
      
      wait 1 milliseconds with messages  // included this to stop maxing out the CPU on the machine, it worked a charm!
   end repeat
   
   put tmpFld2 into field 2
   put tmpFld3 into field 3
   
   --unlock screen
   put "    Putting item 6 & 7 of each line into fields 2 & 3 took " & the seconds -tmpStart & "seconds" into field 1
end mouseUp

A) field updates are a hefty overhead
Yes, if your doing everything in the field I would agree. However, the time difference in this repeat code between what you see above and the previous repeat code in the post was close to nil. Feel free to run it yourself and find out

B) you have two statements that put data into an indexed line, which means that twice within each loop the engine has to count through to the index to update each.
My understanding is that with the line indexed, the engine doesn't loop anything, it goes straight to that line number. In the dictionary, under 'repeat', in the comments, oliver@runrev.com placed an example of nearly this exact same repeat structure. In the comment, they state the following :

oliver@runrev.com wrote: If you want to iterate through a large list, doing something to each item *and* keeping track of which item you are currently at, using the repeat for each form is often faster than the repeat with. For example a structure like this:
Code: Select all
repeat with x = 1 to the number of lines of the text of field "Very Long List"
   doStuff line x of the text of field "Very Long List", x
end repeat
Could be replaced with the faster loop below:
Code: Select all
local tLineNumber
put 1 into tLineNumber
repeat for each line tLine in field "Very Long List"
   doStuff tLine, tLineNumber
   add 1 to tLineNumber
end repeat
The reason for this is that evaluation "line x of field..." requires Revolution to loop down through the field until it finds the correct line.

That would imply (to me) that if your line is indexed, the engine goes straight to that line # skipping any loops through the variable.

C) Surely a more comparative test would be to put the extracted data "after" or even "before" a variable, then drop that data into the field in one go.
I actually tested that as well, there was no improvement vs. the code listed in this post. Again though, these code snippets are tiny, modify and test them and see if you can improve them, I sure didn't do extensive tests on them. I also was creating them under an assumption that the array method proposed by the OP originally was the best method out there though, someone else may well have a faster or better method.

I hope they post it if they do !

i]D) Notwithstanding, the array split is a brilliant method, but relies on data being consistently sane.
I agree, the OP was on the right track originally and it was only a misunderstanding of array keys that tripped them up.

As for the method relying on the data being consistent, ALL methods of doing this are going to rely on that consistency. I can not think of one programmatic way or algorithm that is going to figure out where random data belongs.

Again, if anyone else knows of a way to do that, I'm all ears, I certainly do not know everything

bogs · Post by **bogs** » Wed Feb 26, 2025 3:34 pm

richmond62 wrote: ↑
Wed Feb 26, 2025 3:18 pm
So: obviously itemDelimiter and columnDelimiter work differently.

Well, they delimit different things, certainly. There is a rowDelimiter as well and it delimits... wait for it... ROWS !!!

In all those posts above, I do not believe I posted the link to the data set I used, here that is if you want like to like testing :
https://catalog.data.gov/dataset/electr ... ation-data

richmond62 · Post by **richmond62** » Wed Feb 26, 2025 5:14 pm

it delimits... wait for it... ROWS !!!

You made my day, my love.

bogs · Post by **bogs** » Thu Feb 27, 2025 2:54 pm

dunbarx wrote: ↑
Tue Feb 25, 2025 10:59 pm
Sparkout
<sic>
But I just ran a test with a variable with ten million lines, and the two actions take the same time.
It occurs to me that this post adds nothing to the discussion.

I disagree, I think it added considerable value, since you actually tested it instead of making assumptions, and then reported the result.

dunbarx · Post by **dunbarx** » Thu Feb 27, 2025 3:20 pm

Bogs.

What I meant is that it is a tiny side issue to the main thread.

And that thread is not so much about arrays or the speed difference between "repeat with..." and "repeat for...", but rather all about the fact that one cannot parse data that is, by its nature, unparsable. The OP will not be able to do what he wants to, and it has nothing to do with LiveCode per se.

I was wondering what sort of dataSet was so poorly formed that the "items" within each of its rows was so mangled.

Craig

bogs · Post by **bogs** » Thu Feb 27, 2025 4:31 pm

dunbarx wrote: ↑
Thu Feb 27, 2025 3:20 pm
I was wondering what sort of dataSet was so poorly formed that the "items" within each of its rows was so mangled.

You can see this problem being made every time you see a program that doesn't test for input (if you can form a test for the input, if you can't you can still delimit it properly) or, oddly enough, accepts partially formed input, or doesn't put the input given in a coherent form. Some examples would be:

field input = email address { notoriously difficult to test for correct information, email addresses come in a lot of different forms and with many different domains }
field input = user skipped { code issue resulting in malformed line of items by not including skipped field delimiter (in this case, comma's beginning and end, resulting in different number of items per line } Examples: 1, John, WA <John didn't enter his name, so you get ==> 1, WA <because the programmer did not write code to take blanks into account, so it would be properly formed i.e. 1,,WA.
field input = user error { user enters incorrect or poorly formed data, of course, no one alive has ever seen this happen }
delimiter used = CSV { the bane of many programmers lives, since commas are common punctuation as well. A poorly placed comma in an input field if the input is not tested for (so the comma isn't replaced with something else) causes extra items in the line } Example: look at the picture where I was discussing the poorly formed data in a previous post, one line had more items than any other did.

download/file.php?id=22049&mode=view
Etc. etc. etc

But I'm sure you already know or could imagine how this would happen.

dunbarx · Post by **dunbarx** » Thu Feb 27, 2025 5:50 pm

Bogs.

That is one thing I like about Excel, the data is parsed with tabs and returns automatically, assuming one has it set up properly and does not make an entry error.

So again I wonder where the OP's original data came from. It was purported to be a very large file, indicating a "grown-up" sort of effort, that is, not created by a child. But it is amazing to me that it might just as well have been.

Craig

bogs · Post by **bogs** » Thu Feb 27, 2025 6:59 pm

I suspect it did not grow to that size in a jump heh, but likely over a long LONG period of time, and possibly through several different programmers such as the OP being the latest. That it was allowed to grow to such a size tells me that no one thought to break it down into smaller subsets, which indicates the various programmers involved were probably not turning to a software engineer for guidelines.

Of course at this point, I would seriously suggest that whoever is "next in line..." do exactly that, break it down into smaller subsets that are more easily managed. No larger than say, the OP's stated 'sample' size of 300 - 500 megs, if not smaller {100 megs would take about a full second to split up using the array method the OP first proposed }.

SparkOut · Post by **SparkOut** » Thu Feb 27, 2025 7:52 pm

I believe the OP said the dataset had a variable number of items per line, but the data required to extract was always the last two items in each line - hence trying to convert to array and then hoping to obtain the last items with keys of -2 and -1.
Array keys don't work like that, but it is possible to create sanity from this dataset eith a repeat for each loop.
Just there is a performance overhead to that.

dunbarx · Post by **dunbarx** » Thu Feb 27, 2025 7:55 pm

Bogs.

I created a list of 100 million lines, each line containing five chars separated by tabs, about one GB.

To extract the last two items from each line into a new line in a new variable took 46 seconds.

This on an M2 Mac Mini.

Craig

dunbarx · Post by **dunbarx** » Thu Feb 27, 2025 7:59 pm

Sparkout.

Playing with arrays notwithstanding, the OP has to hope that the last two items are always ordered well enough to be the ones he needs. In other words, the broken structure of the rest of the line can be safely ignored. Likely anything that digs "deeper" into that data will fail.

Craig

SparkOut · Post by **SparkOut** » Thu Feb 27, 2025 8:10 pm

I agree. Insane datasets will either induce insanity or pain, or both

SparkOut · Post by **SparkOut** » Thu Feb 27, 2025 8:14 pm

dunbarx wrote: ↑
Thu Feb 27, 2025 7:55 pm
Bogs.

I created a list of 100 million lines, each line containing five chars separated by tabs, about one GB.

To extract the last two items from each line into a new line in a new variable took 46 seconds.

This on an M2 Mac Mini.

Craig

I suspect a lot of the delay in the testing bogs did might have been from the updates made in the message box. That would certainly create an overhead in field display updates.
The wait with messages inside the loop is a valuable way to stop the ui locking up though.

LiveCode Forums.

Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns

Re: Get the last two columns