Faster way to remove duplicate lines?

Got a LiveCode personal license? Are you a beginner, hobbyist or educator that's new to LiveCode? This forum is the place to go for help getting started. Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller

edgore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 197
Joined: Wed Jun 14, 2006 8:40 pm

Re: Faster way to remove duplicate lines?

Post by edgore » Thu Aug 01, 2013 4:50 pm

@MaxV

In a multiuser environment I would still have to randomly generate a unique table name, create it and drop it for each user in order to avoid conflicts though, right? It is cool that I can just create a database file within using livecode and then populate it with whatever tables I need on demand - that's going to be amazingly useful!

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10305
Joined: Wed May 06, 2009 2:28 pm

Re: Faster way to remove duplicate lines?

Post by dunbarx » Thu Aug 01, 2013 6:52 pm

We cannot just let this tantalizing method go.

I see that the issue is that the keys of the array are split by the first space in the examples, but those keys are part and parcel of the text in each line. So on a list of 16000 lines, where a snippet of that list is:

aa bb cc
aa bb cc
xx yy zz
ss dd ff
aa bb cc
aa bb cc
xx yy zz
ss dd ff

I did this:

Code: Select all

on mouseup
   get fld "output"
   split it by return and return
  combine it by return and space
end mouseup
This seems to work, since the keys no longer are part of the composition of the text in each line. I am trying to figure out why the first line is duplicated, but that can be solved without even understanding it.

Craig

wsamples
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 264
Joined: Mon May 18, 2009 4:12 am

Re: Faster way to remove duplicate lines?

Post by wsamples » Fri Aug 02, 2013 1:16 am

Something very similar was posted either here in the forum or on the use-list a few years ago. I have this example snippet stored:

on mouseup
put empty into field "Field2"
put field "Field" into tList
get list.deleteDuplicates(tList)
put it after field "Field2"
answer the number of lines in field "Field" & comma && the number of lines in field "Field2"
end mouseup

function list.deleteDuplicates pList
split pList by return and comma
combine pList by return and comma
sort pList
return pList
end list.deleteDuplicates

It works in my scenario. The function is a tiny bit simpler than what edgore suggests. It works in my scenario, however I note that if a line does not contain a comma, one will be added at the end of it. Lines that do contain a comma somewhere, anywhere, don't suffer this affliction. It does seem to handle lines which start with the same word but are not otherwise identical. It seems as fast as Klauss' method if I add a sort on the keys before putting them in the field in Klauss' method.

The method suggested by Klauss is case insensitive and this could be important for some uses where you wouldn't want A=a and my function does not find differences that occur after a comma in the line which could be a fatal problem. A bullet-proof solution seems elusive. I suppose you could split by some char you're certain isn't in any of the lines and delete it from the list:

function list.deleteDuplicates pList
split pList by return and "^"
combine pList by return and "^"
sort pList
replace "^" with "" in pList
return pList
end list.deleteDuplicates

works with my list. Can you make the keys of an array be case sensitive so Klauss' method would detect case differences as unique lines?

edgore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 197
Joined: Wed Jun 14, 2006 8:40 pm

Re: Faster way to remove duplicate lines?

Post by edgore » Fri Aug 02, 2013 5:23 am

Here's an experiment using several of what I think would be very rare characters as the possible split delimiter. Seems to work perfectly, and I would love the see the data that breaks it. There is a stack attached as well that demonstrates it's strange powers.

Code: Select all

on mouseUp
   put deleteDuplicateLines(field "theText") into noDupes
   put noDupes
end mouseUp

function deleteDuplicateLines theList 
   switch
      case thelist contains numToChar(7) is false --BELL
         split theList by return and numToChar(7)
         combine theList by return and numToChar(7)
         replace numToChar(7) with empty in theList
         return theList 
         break
      case thelist contains numToChar(5) is false --ENQUIRY
         split theList by return and numToChar(5)
         combine theList by return and numToChar(5)
         replace numToChar(5) with empty in theList
         return theList 
         break
      case thelist contains numToChar(22) is false --SYNCHRONOUS IDLE
         split theList by return and numToChar(22)
         combine theList by return and numToChar(22)
         replace numToChar(22) with empty in theList
         return theList 
         break
      case thelist contains numToChar(27) is false --ESCAPE
         split theList by return and numToChar(27)
         combine theList by return and numToChar(27)
         replace numToChar(27) with empty in theList
         return theList 
         break
      default
         return "Who are you and where did you get this data?!"
   end switch
end deleteDuplicateLines
Attachments
deleteDuplicateLines.zip
Too much time
(1.53 KiB) Downloaded 326 times

MaxV
Posts: 1580
Joined: Tue May 28, 2013 2:20 pm
Contact:

Re: Faster way to remove duplicate lines?

Post by MaxV » Fri Aug 02, 2013 11:41 am

edgore wrote:@MaxV

In a multiuser environment I would still have to randomly generate a unique table name, create it and drop it for each user in order to avoid conflicts though, right?
No, just add the user name to the database name, so everybody use it's own db. SQLite isn't multi user. Every DB can be accessed by just one user a time. SQLite DB are like files, same permissions. So if you lock a file for writing, others can't write it. Permissions depend also on the operating system you use, because are the same of the OS file permissions.
It is cool that I can just create a database file within using livecode and then populate it with whatever tables I need on demand - that's going to be amazingly useful!
Absolutely YES
Livecode Wiki: http://livecode.wikia.com
My blog: https://livecode-blogger.blogspot.com
To post code use this: http://tinyurl.com/ogp6d5w

edgore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 197
Joined: Wed Jun 14, 2006 8:40 pm

Re: Faster way to remove duplicate lines?

Post by edgore » Fri Aug 02, 2013 3:22 pm

I looked around and here is the older thread that was mentioned a couple of posts back - it is very similar, but did not get into exploring some of the pitfalls/caveats and other methods that this thread has. Still, nice to have everything referenced in one place.

http://forums.runrev.com/viewtopic.php? ... ates#p5234

edgore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 197
Joined: Wed Jun 14, 2006 8:40 pm

Re: Faster way to remove duplicate lines?

Post by edgore » Fri Aug 02, 2013 3:26 pm

MaxV wrote: No, just add the user name to the database name, so everybody use it's own db.
That's pretty much what I was saying - since I don't have user ids I would randomly generate a 6 digit number to use for the name so there would not be conflicts.

I will have to read up on SQLite and experiment some with it - you say that it's not multiuser - is that becasue of write locking only? Can multiple users read from the same database at the same time? This might be worth moving to it's own thread (though I imagine there are probably already SQLite threads already out there I have not read)

sritcp
Posts: 431
Joined: Tue Jun 05, 2012 5:38 pm

Re: Faster way to remove duplicate lines?

Post by sritcp » Fri Aug 02, 2013 3:49 pm

Klaus's solution of making the data lines into array keys is utterly elegant and is really the efficient general solution to removing duplicates.

Sri.

wsamples
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 264
Joined: Mon May 18, 2009 4:12 am

Re: Faster way to remove duplicate lines?

Post by wsamples » Fri Aug 02, 2013 4:18 pm

sritcp wrote:Klaus's solution of making the data lines into array keys is utterly elegant and is really the efficient general solution to removing duplicates.

Sri.
As mentioned above, it has the defect of being case insensitive and is no faster than split and combine.

edgore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 197
Joined: Wed Jun 14, 2006 8:40 pm

Re: Faster way to remove duplicate lines?

Post by edgore » Fri Aug 02, 2013 4:25 pm

@Sri, you are right, it just took me a while to figure out what he is doing. Basically he's adding "1" to an existing array (created earlier in the script), to an element with the key "whatever the line of the list you are processing is". This creates a new key with the value of the line, and adds 1 to the whatever it's called under that key in the array.

A side benefit of this, I guess, would be that you could also use this to very quickly count the number of occurrences of a particular thing in the container you are processing, since each time it tries to create a key it's adding 1 to the value under that key (I can't remember what the actual non-key content in an array is called...). So if there were eight lines in a particular container that were identical you would end up with a count of eight associated with that key in the array.

@wsamples I am not sure that I can think of anything that would add case sensitivity to any solution - anything that depends on cheating with the array functions will have that issue. The big advantage that Klaus' solution has is that it doesn't care about the content of the line you are processing - you don't have to worry about using a delimiter in the split that occurs in the text you are splitting. (I think - I haven't done any testing yet, but theoretically it shouldn't matter)

wsamples
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 264
Joined: Mon May 18, 2009 4:12 am

Re: Faster way to remove duplicate lines?

Post by wsamples » Fri Aug 02, 2013 5:32 pm

@wsamples I am not sure that I can think of anything that would add case sensitivity to any solution - anything that depends on cheating with the array functions will have that issue. The big advantage that Klaus' solution has is that it doesn't care about the content of the line you are processing - you don't have to worry about using a delimiter in the split that occurs in the text you are splitting. (I think - I haven't done any testing yet, but theoretically it shouldn't matter)
I recognize the advantage Klauss' method possesses in avoiding the problem with the delimiter.

Split/combine does detect case differences and will count "apple" and "Apple" as distinct, unique lines.

edgore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 197
Joined: Wed Jun 14, 2006 8:40 pm

Re: Faster way to remove duplicate lines?

Post by edgore » Fri Aug 02, 2013 6:01 pm

Oh, I had not realized that it did that - neat! I just tried it in the sample stack I uploaded and sure enough changing Gregor to gregor in one line results in it not being deleted as a duplicate.

For what I am doing though, that's actually a bug :).

So, it looks like we have two good solutions:

The array solution from Klaus if you want case-insensitive
Split/combine with a bizarre second delimiter if you want case-sensitive

I am curious to understand why one method of creating the keys of an array would be case sensitive, while the other is not - you would think that sensitivity would be in the engine code managing the array, not in the code for creating the keys (arrays are sufficiently advanced to be indistinguishable from magic for me to start with)

wsamples
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 264
Joined: Mon May 18, 2009 4:12 am

Re: Faster way to remove duplicate lines?

Post by wsamples » Fri Aug 02, 2013 6:35 pm

Setting the caseSensitive to true at the start of the handler that calls Klauss' method makes it case sensitive. So, this is something that can be toggled according to one's needs. I wonder what other surprises might be lurking under the stones :D

edgore
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 197
Joined: Wed Jun 14, 2006 8:40 pm

Re: Faster way to remove duplicate lines?

Post by edgore » Fri Aug 02, 2013 6:40 pm

Interesting - I wonder if that means that the sensitivity in split/combine is a bug...

Post Reply