Page 2 of 2
Re: Need for Speed
Posted: Sun Sep 11, 2022 9:20 pm
by zink137
Code: Select all
Repeat with zz=the number of lines of Big down to 1 --delete duplicates
if line zz of Big=line zz-1 of Big then delete line zz-1 of Big
end Repeat
Code: Select all
put empty into zz
repeat forever --delete duplicates
add 1 to zz
if zz=the number of lines of Big then exit repeat
repeat forever
if line zz of Big=line zz+1 of Big then delete line zz of Big
if line zz of Big≠line zz+1 of Big then exit repeat
end repeat
if zz=the number of lines of Big then exit repeat
end repeat
The second code is about a magnitude faster than the first.
Re: Need for Speed
Posted: Sun Sep 11, 2022 9:58 pm
by SparkOut
Did you try the array method? Or Richmond's split/combine?
Your second code still has to count lines, which is a much greater overhead than the other suggested methods. "A magnitude faster" means reducing the time taken from hours to ... minutes?
I think you might be surprised by the other methods, maybe measuring in milliseconds. Or maybe I'll be surprised - either way, your feedback from a comparison would be very interesting.
Re: Need for Speed
Posted: Sun Sep 11, 2022 10:59 pm
by zink137
SparkOut,
The array thing is over my head and this project isn't important enough to learn it (arrays) at the present.
The second code is 10 times faster than the first--yes, hours to minutes.
I'm assuming the counting thing is always from the beginning and most of the duplicates are at the beginning, so the second code starts at the beginning instead of the end. Generating the list for 50,000 and then sorting it, puts 50,000 1's at the beginning and then 50,000 2's and so on, the number of duplicates decreasing until the end, the largest number for which there is no duplicate and by-the-way it's 121,012,864. The list for 50k is just over 5 million digits long.
What this is, is taking any positive integer and if it's odd multiply by 3 and add 1. If it's even divide by 2. And keep doing that to the result. So you get a sequence and the sequence for each integer is different. Collatz's conjecture ( I don't know how to make the link thing work, but it's at wikipedia) is that all sequences eventually converge to 1. But it's never been proven.
I'm creating a super sequence, a sequence of sequences, putting all the sequences up to 50K together, removing all the duplicates and seeing what remains. Each sequence is vaguely like a random walk, except it's not random.
Thanks,
Nelson
Re: Need for Speed
Posted: Sun Sep 11, 2022 11:04 pm
by richmond62
I posted the Wikipedia link a while ago.
Re: Need for Speed
Posted: Sun Sep 11, 2022 11:20 pm
by zink137
richmond62,
How did you do it (making a link)?
Nelson
Re: Need for Speed
Posted: Mon Sep 12, 2022 4:29 am
by rkriesel
FourthWorld wrote: Sun Sep 11, 2022 1:06 am
Good call on using both split and combine...
agreed, in general, but here split can be adequate without combine:
Code: Select all
function dedupeViaSplit pLines -- not sustaining the sequence
split pLines by cr as set
return the keys of pLines
end dedupeViaSplit
-- Dick
Re: Need for Speed
Posted: Mon Sep 12, 2022 6:49 am
by scott_morrow
I've made some changes to Richmond's stack so that it can compare the three methods for removing duplicates suggested here:
1) Array Keys -- fastest
2) Split / Combine -- next fastest
3) repeat for each -- just a bit slower than split/combine
Re: Need for Speed
Posted: Mon Sep 12, 2022 7:26 am
by scott_morrow
I didn't see Dick's function, which left out the "combine" step, until after I posted Richmond's updated test stack. However, I went back and compared these two functions and the one with "combine" ran significantly faster. (sorting had no effect) I assume it is the "as set" form but am not familiar enough with split to understand why that might be.
Code: Select all
function dedupeViaSplit pLines
split pLines by cr as set
-- return the keys of pLines
put (the keys of pLines) into tDeDupedLines
sort tDeDupedLines numeric
return tDeDupedLines
end dedupeViaSplit
function DOUBLETROUBLE WXYZ
split WXYZ by return and space
combine WXYZ by return and null
sort WXYZ numeric
return WXYZ
end DOUBLETROUBLE