Faster way to remove duplicate lines?
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller
Re: Faster way to remove duplicate lines?
@MaxV
In a multiuser environment I would still have to randomly generate a unique table name, create it and drop it for each user in order to avoid conflicts though, right? It is cool that I can just create a database file within using livecode and then populate it with whatever tables I need on demand - that's going to be amazingly useful!
In a multiuser environment I would still have to randomly generate a unique table name, create it and drop it for each user in order to avoid conflicts though, right? It is cool that I can just create a database file within using livecode and then populate it with whatever tables I need on demand - that's going to be amazingly useful!
Re: Faster way to remove duplicate lines?
We cannot just let this tantalizing method go.
I see that the issue is that the keys of the array are split by the first space in the examples, but those keys are part and parcel of the text in each line. So on a list of 16000 lines, where a snippet of that list is:
aa bb cc
aa bb cc
xx yy zz
ss dd ff
aa bb cc
aa bb cc
xx yy zz
ss dd ff
I did this:
This seems to work, since the keys no longer are part of the composition of the text in each line. I am trying to figure out why the first line is duplicated, but that can be solved without even understanding it.
Craig
I see that the issue is that the keys of the array are split by the first space in the examples, but those keys are part and parcel of the text in each line. So on a list of 16000 lines, where a snippet of that list is:
aa bb cc
aa bb cc
xx yy zz
ss dd ff
aa bb cc
aa bb cc
xx yy zz
ss dd ff
I did this:
Code: Select all
on mouseup
get fld "output"
split it by return and return
combine it by return and space
end mouseup
Craig
Re: Faster way to remove duplicate lines?
Something very similar was posted either here in the forum or on the use-list a few years ago. I have this example snippet stored:
on mouseup
put empty into field "Field2"
put field "Field" into tList
get list.deleteDuplicates(tList)
put it after field "Field2"
answer the number of lines in field "Field" & comma && the number of lines in field "Field2"
end mouseup
function list.deleteDuplicates pList
split pList by return and comma
combine pList by return and comma
sort pList
return pList
end list.deleteDuplicates
It works in my scenario. The function is a tiny bit simpler than what edgore suggests. It works in my scenario, however I note that if a line does not contain a comma, one will be added at the end of it. Lines that do contain a comma somewhere, anywhere, don't suffer this affliction. It does seem to handle lines which start with the same word but are not otherwise identical. It seems as fast as Klauss' method if I add a sort on the keys before putting them in the field in Klauss' method.
The method suggested by Klauss is case insensitive and this could be important for some uses where you wouldn't want A=a and my function does not find differences that occur after a comma in the line which could be a fatal problem. A bullet-proof solution seems elusive. I suppose you could split by some char you're certain isn't in any of the lines and delete it from the list:
function list.deleteDuplicates pList
split pList by return and "^"
combine pList by return and "^"
sort pList
replace "^" with "" in pList
return pList
end list.deleteDuplicates
works with my list. Can you make the keys of an array be case sensitive so Klauss' method would detect case differences as unique lines?
on mouseup
put empty into field "Field2"
put field "Field" into tList
get list.deleteDuplicates(tList)
put it after field "Field2"
answer the number of lines in field "Field" & comma && the number of lines in field "Field2"
end mouseup
function list.deleteDuplicates pList
split pList by return and comma
combine pList by return and comma
sort pList
return pList
end list.deleteDuplicates
It works in my scenario. The function is a tiny bit simpler than what edgore suggests. It works in my scenario, however I note that if a line does not contain a comma, one will be added at the end of it. Lines that do contain a comma somewhere, anywhere, don't suffer this affliction. It does seem to handle lines which start with the same word but are not otherwise identical. It seems as fast as Klauss' method if I add a sort on the keys before putting them in the field in Klauss' method.
The method suggested by Klauss is case insensitive and this could be important for some uses where you wouldn't want A=a and my function does not find differences that occur after a comma in the line which could be a fatal problem. A bullet-proof solution seems elusive. I suppose you could split by some char you're certain isn't in any of the lines and delete it from the list:
function list.deleteDuplicates pList
split pList by return and "^"
combine pList by return and "^"
sort pList
replace "^" with "" in pList
return pList
end list.deleteDuplicates
works with my list. Can you make the keys of an array be case sensitive so Klauss' method would detect case differences as unique lines?
Re: Faster way to remove duplicate lines?
Here's an experiment using several of what I think would be very rare characters as the possible split delimiter. Seems to work perfectly, and I would love the see the data that breaks it. There is a stack attached as well that demonstrates it's strange powers.
Code: Select all
on mouseUp
put deleteDuplicateLines(field "theText") into noDupes
put noDupes
end mouseUp
function deleteDuplicateLines theList
switch
case thelist contains numToChar(7) is false --BELL
split theList by return and numToChar(7)
combine theList by return and numToChar(7)
replace numToChar(7) with empty in theList
return theList
break
case thelist contains numToChar(5) is false --ENQUIRY
split theList by return and numToChar(5)
combine theList by return and numToChar(5)
replace numToChar(5) with empty in theList
return theList
break
case thelist contains numToChar(22) is false --SYNCHRONOUS IDLE
split theList by return and numToChar(22)
combine theList by return and numToChar(22)
replace numToChar(22) with empty in theList
return theList
break
case thelist contains numToChar(27) is false --ESCAPE
split theList by return and numToChar(27)
combine theList by return and numToChar(27)
replace numToChar(27) with empty in theList
return theList
break
default
return "Who are you and where did you get this data?!"
end switch
end deleteDuplicateLines
- Attachments
-
- deleteDuplicateLines.zip
- Too much time
- (1.53 KiB) Downloaded 326 times
Re: Faster way to remove duplicate lines?
No, just add the user name to the database name, so everybody use it's own db. SQLite isn't multi user. Every DB can be accessed by just one user a time. SQLite DB are like files, same permissions. So if you lock a file for writing, others can't write it. Permissions depend also on the operating system you use, because are the same of the OS file permissions.edgore wrote:@MaxV
In a multiuser environment I would still have to randomly generate a unique table name, create it and drop it for each user in order to avoid conflicts though, right?
Absolutely YESIt is cool that I can just create a database file within using livecode and then populate it with whatever tables I need on demand - that's going to be amazingly useful!
Livecode Wiki: http://livecode.wikia.com
My blog: https://livecode-blogger.blogspot.com
To post code use this: http://tinyurl.com/ogp6d5w
My blog: https://livecode-blogger.blogspot.com
To post code use this: http://tinyurl.com/ogp6d5w
Re: Faster way to remove duplicate lines?
I looked around and here is the older thread that was mentioned a couple of posts back - it is very similar, but did not get into exploring some of the pitfalls/caveats and other methods that this thread has. Still, nice to have everything referenced in one place.
http://forums.runrev.com/viewtopic.php? ... ates#p5234
http://forums.runrev.com/viewtopic.php? ... ates#p5234
Re: Faster way to remove duplicate lines?
That's pretty much what I was saying - since I don't have user ids I would randomly generate a 6 digit number to use for the name so there would not be conflicts.MaxV wrote: No, just add the user name to the database name, so everybody use it's own db.
I will have to read up on SQLite and experiment some with it - you say that it's not multiuser - is that becasue of write locking only? Can multiple users read from the same database at the same time? This might be worth moving to it's own thread (though I imagine there are probably already SQLite threads already out there I have not read)
Re: Faster way to remove duplicate lines?
Klaus's solution of making the data lines into array keys is utterly elegant and is really the efficient general solution to removing duplicates.
Sri.
Sri.
Re: Faster way to remove duplicate lines?
As mentioned above, it has the defect of being case insensitive and is no faster than split and combine.sritcp wrote:Klaus's solution of making the data lines into array keys is utterly elegant and is really the efficient general solution to removing duplicates.
Sri.
Re: Faster way to remove duplicate lines?
@Sri, you are right, it just took me a while to figure out what he is doing. Basically he's adding "1" to an existing array (created earlier in the script), to an element with the key "whatever the line of the list you are processing is". This creates a new key with the value of the line, and adds 1 to the whatever it's called under that key in the array.
A side benefit of this, I guess, would be that you could also use this to very quickly count the number of occurrences of a particular thing in the container you are processing, since each time it tries to create a key it's adding 1 to the value under that key (I can't remember what the actual non-key content in an array is called...). So if there were eight lines in a particular container that were identical you would end up with a count of eight associated with that key in the array.
@wsamples I am not sure that I can think of anything that would add case sensitivity to any solution - anything that depends on cheating with the array functions will have that issue. The big advantage that Klaus' solution has is that it doesn't care about the content of the line you are processing - you don't have to worry about using a delimiter in the split that occurs in the text you are splitting. (I think - I haven't done any testing yet, but theoretically it shouldn't matter)
A side benefit of this, I guess, would be that you could also use this to very quickly count the number of occurrences of a particular thing in the container you are processing, since each time it tries to create a key it's adding 1 to the value under that key (I can't remember what the actual non-key content in an array is called...). So if there were eight lines in a particular container that were identical you would end up with a count of eight associated with that key in the array.
@wsamples I am not sure that I can think of anything that would add case sensitivity to any solution - anything that depends on cheating with the array functions will have that issue. The big advantage that Klaus' solution has is that it doesn't care about the content of the line you are processing - you don't have to worry about using a delimiter in the split that occurs in the text you are splitting. (I think - I haven't done any testing yet, but theoretically it shouldn't matter)
Re: Faster way to remove duplicate lines?
I recognize the advantage Klauss' method possesses in avoiding the problem with the delimiter.@wsamples I am not sure that I can think of anything that would add case sensitivity to any solution - anything that depends on cheating with the array functions will have that issue. The big advantage that Klaus' solution has is that it doesn't care about the content of the line you are processing - you don't have to worry about using a delimiter in the split that occurs in the text you are splitting. (I think - I haven't done any testing yet, but theoretically it shouldn't matter)
Split/combine does detect case differences and will count "apple" and "Apple" as distinct, unique lines.
Re: Faster way to remove duplicate lines?
Oh, I had not realized that it did that - neat! I just tried it in the sample stack I uploaded and sure enough changing Gregor to gregor in one line results in it not being deleted as a duplicate.
For what I am doing though, that's actually a bug
.
So, it looks like we have two good solutions:
The array solution from Klaus if you want case-insensitive
Split/combine with a bizarre second delimiter if you want case-sensitive
I am curious to understand why one method of creating the keys of an array would be case sensitive, while the other is not - you would think that sensitivity would be in the engine code managing the array, not in the code for creating the keys (arrays are sufficiently advanced to be indistinguishable from magic for me to start with)
For what I am doing though, that's actually a bug

So, it looks like we have two good solutions:
The array solution from Klaus if you want case-insensitive
Split/combine with a bizarre second delimiter if you want case-sensitive
I am curious to understand why one method of creating the keys of an array would be case sensitive, while the other is not - you would think that sensitivity would be in the engine code managing the array, not in the code for creating the keys (arrays are sufficiently advanced to be indistinguishable from magic for me to start with)
Re: Faster way to remove duplicate lines?
Setting the caseSensitive to true at the start of the handler that calls Klauss' method makes it case sensitive. So, this is something that can be toggled according to one's needs. I wonder what other surprises might be lurking under the stones 

Re: Faster way to remove duplicate lines?
Interesting - I wonder if that means that the sensitivity in split/combine is a bug...