Page 1 of 2

Faster way to remove duplicate lines?

Posted: Mon Jul 29, 2013 4:23 pm
by edgore
I have some very long (25K lines) lists that I would like to filter for duplicates, and I am wondering if there is a faster way to do this than what I am currently doing.

Right now I have tried

Code: Select all

on mouseUp
   put field 1 into Temp --field 1 contains about 25K lines
   put empty into noDupes
   put the milliseconds into timer
   repeat for each line theLine in Temp
      if noDupes contains theLine is false then
         put theLine & return after noDupes
      end if
   end repeat
   delete the last char of noDupes
   put (the milliseconds - timer) & return & noDupes
end mouseUp
which takes about 108399 milliseconds, and I tried

Code: Select all

on mouseUp
   put field 1 into Temp --field 1 contains about 25K lines
   put empty into noDupes
   put the milliseconds into timer
   repeat for each line theLine in Temp
      if theLine is not among the lines of noDupes then
         put theLine & return after noDupes
      end if
   end repeat
   delete the last char of noDupes
   put (the milliseconds - timer) & return & noDupes
end mouseUp
which takes about 107540 milliseconds, a difference that could come down to my having switched over to email while it was running.

Is there any sneaky way to accomplish this faster? I know that it's grinding because each run through the loop has to compare against an ever increasing list of unique lines, but I can't think of away to avoid that.

Re: Faster way to remove duplicates?

Posted: Mon Jul 29, 2013 4:53 pm
by edgore
Oh, and I also tried this:

Code: Select all

on mouseUp
   put deleteDuplicateLines(field 1) into noDupes
   put the number of lines in field 1 &&  the number of lines in noDupes
end mouseUp

function deleteDuplicateLines theList 
   split theList by return
   combine theList by return
   return theList 
end deleteDuplicateLines
which is almost instantaneous, but doesn't seem to actually remove any lines (the line count *should* drop from 25778 to 23440 in my stack), instead this script reports back 25778 before and after

Re: Faster way to remove duplicates?

Posted: Mon Jul 29, 2013 4:58 pm
by edgore
Aha! I figured it out!

the script below results in ridiculously fast duplicate removal:

Code: Select all

on mouseUp
   put deleteDuplicateLines(field 1) into noDupes
   put the number of lines in field 1 &&  the number of lines in noDupes
end mouseUp

function deleteDuplicateLines theList --this function removes duplicate lines from a list
   split theList by return and space
   combine theList by return and space
   replace space & return with return in theList
   delete the last char of theList
   return theList
end deleteDuplicateLines
As a side effect, it also sorts the lines of the container.

It also shows I need to spend more time learning how split and combine work.

Re: Faster way to remove duplicates?

Posted: Mon Jul 29, 2013 7:46 pm
by Simon
Wow,
Interesting use of split/combine :)

Simon

Re: Faster way to remove duplicates?

Posted: Mon Jul 29, 2013 9:33 pm
by dunbarx
Edgore.

I am with Simon. Simple when you see it. Devilishly clever of you.

Craig

Re: Faster way to remove duplicates?

Posted: Mon Jul 29, 2013 10:04 pm
by edgore
I figured there *had* to be some way of leveraging unique keys in arrays to do this, just took a while to work it out. Hopefully there is not a downside that I am not thinking about. My initial need, and all of my testing were for a single word on the line, but I just tested it with longer lines and spaces in the line and it seems to still work just fine.

Edit - don't trust the sort though - it's strange with alpha characters. Do your own sort if you need that.

Re: Faster way to remove duplicates?

Posted: Mon Jul 29, 2013 11:53 pm
by edgore
Found a bug - it adds a space on to the end of each line. I have updated the routine above to correct this

Re: Faster way to remove duplicate lines?

Posted: Wed Jul 31, 2013 1:09 pm
by MaxV
Livecode embed SQLITE, so you can use it this way:

Code: Select all

put the text of field 1 into Temp
put "./temp.sqlite" into tDatabasePath
put revOpenDatabase("sqlite", tDatabasePath, , , , ) into connID
put "CREATE TABLE temp (list TEXT) ; " into tSQL  #there are a lot of temp variables here... ;-)
revExecuteSQL connID,tSQL       
put "INSERT INTO temp   VALUES " into tSQL
repeat for each line tempLine in Temp
put " ('" & tempLine & "') ," after tSQL
end repeat
delete last char of tSQL #this remove the last comma
put " ; " after tSQL
revExecuteSQL connID,tSQL       
put "SELECT DISTINCT list FROM temp  ; " into  tSQL
put revDataFromQuery(tab,return,connID,tSQL) into tRecords
tRecords contains all lines without duplicates!
SQLite engine is very quick.
Remeber to delete temp.sqlite each time before using again the code.

Re: Faster way to remove duplicate lines?

Posted: Wed Jul 31, 2013 5:03 pm
by edgore
That requires having an SQLite database running outside of Livecode though, right? I am trying to avoid external dependencies. I had thought about using a database to accomplish this, but that opens up a set of external requirements for the code to function that I would like to avoid. Also, in a multi user environment (like on server) you would end up having to create and delete a lot of randomly, uniquely named tables each time the delete duplicates code ran in any script, which seems like asking for trouble.

The array solution seems to be working in all my testing so far, and as far as I know will run anywhere with just the engine installed and nothing but Livecode.

Correct me if my assumptions are wrong - Livecode doesn't have a built-in SQLite database in the engine that I didn't know about, does it? (that would be awesome, if it did...)

Edit - after doing some reading, I see that yes, SQLite is basically implemented in the engine and I can now use it for all sorts of useful things. I still like the array solution for removing duplicates, for the reasons mentioned above about multiple users, but I have a bunch of situations where I want to cache information from other databases locally for use in multiple scripts for multiple users and it looks like SQLite will be perfect for that! I would just have to make sure that I delete the cache every 24 hours or so.

Re: Faster way to remove duplicate lines?

Posted: Wed Jul 31, 2013 5:14 pm
by dunbarx
I think Max was experimenting, like all of us.

I think your array gadget is as good as it gets.

To look at it another way:

on mouseUp
put temp into myArray[test]
put temp into myArray[test]
put temp into myArray[test]
end mouseUp

gives only one value for the single key "test". Love that...

Craig

Re: Faster way to remove duplicate lines?

Posted: Wed Jul 31, 2013 8:05 pm
by edgore
Oh, agreed - all experimentation is good. I was just saying why the DB solution was not a good fit for what I was doing.

Also, there should be a section in the docs talking about using SQLite - the only reference I could find in the docs is a paragraph saying there is a driver for it. Since I didn't know anything about what SQLite was, I didn't realize that this meant there was a full implementation of it- I assumed that you needed to have a DB instance running someplace for your application to talk to (that's how it works with everything I have to deal with - Oracle, DB2, MySQL). Finding out that the DB functionality is built into the driver and that I can just create and use databases as needed in Livecode opens up a lot of possibilities. There should be a huge banner that says "HEY! THIS MEANS THERE IS A FUNCTIONAL SQL DATABASE INCLUDED IN LIVECODE DUMMY!". That, I might have noticed.

(Of course, now I want to be able to create virtual database objects in memory, without having to write external files...)

Re: Faster way to remove duplicate lines?

Posted: Thu Aug 01, 2013 1:15 pm
by MaxV
Well, you don't need real external database. When you write

Code: Select all

put "./temp.sqlite" into tDatabasePath
put revOpenDatabase("sqlite", tDatabasePath, , , , ) into connID 
if Livecode don't found it, Livecode creates the database.

just add:

Code: Select all

put "DROP TABLE temp  ; " into tSQL  #there are a lot of temp variables here... ;-)
revExecuteSQL connID,tSQL 
at the end of your function, and you don't need to delete any file.

Example:

Code: Select all

on nodup
put the text of field 1 into Temp
put "./temp.sqlite" into tDatabasePath
put revOpenDatabase("sqlite", tDatabasePath, , , , ) into connID
put "CREATE TABLE temp (list TEXT) ; " into tSQL  #there are a lot of temp variables here... ;-)
revExecuteSQL connID,tSQL       
put "INSERT INTO temp   VALUES " into tSQL
repeat for each line tempLine in Temp
put " ('" & tempLine & "') ," after tSQL
end repeat
delete last char of tSQL #this remove the last comma
put " ; " after tSQL
revExecuteSQL connID,tSQL       
put "SELECT DISTINCT list FROM temp  ; " into  tSQL
put revDataFromQuery(tab,return,connID,tSQL) into tRecords
put "DROP TABLE temp  ; " into tSQL  
revExecuteSQL connID,tSQL 
end nodup
This way you don't care if there is a DB or not. 8-)

Re: Faster way to remove duplicate lines?

Posted: Thu Aug 01, 2013 1:42 pm
by Klaus
Hi Ed,

sorry, for chiming in a bit late.

Please try the ARRAY approach to this problem.
Tested with 10,000 (short) lines with example data -> first_name name,
which contained 22 duplicates and it took 19 millisecs :-)

Code: Select all

on mouseUp
      put field 1 into Temp 
      ## field 1 contains 10K lines in my example
      put empty into tArray
      put the milliseconds into timer

     ## We "abuse" the KEYs of that array for our purpose!
      repeat for each line theLine in Temp
          add 1 to tArray[theLine]
      end repeat
   put the keys of tArray into fld 2
   put (the milliseconds - timer) 
end mouseUp
At least worth a try!

EDIT: Just tested with 30,000 lines and took 22 millisecs :-)


Best

Klaus

Re: Faster way to remove duplicate lines?

Posted: Thu Aug 01, 2013 2:15 pm
by sritcp
edgore:

Am I correct in understanding that your array method will only work if the first word of each line (the "key") is unique for each unique line?
That is, if there are two non-duplicate lines that happen to share the same first word, your routine will delete all but the first such occurrence?

Sri.

Re: Faster way to remove duplicate lines?

Posted: Thu Aug 01, 2013 4:44 pm
by edgore
@sritcp
Just tested it with a field that contains

item 1
item 2
item 3
item 4
item 4
item 4
item 5

And you are right, it comes back with a single line, containing "item 5", so it is only working when the first word is unique. I knew there had to be a catch - that is what I get for playing with arrays when I don't really understand them.

It still works for what I was doing, but it's not a general solution. I need to look into it a little more and see what else I can come up with - maybe using an unusual character for the second delimiter? Something taht is not going to show up in text generally? I also need to look at what Klaus just proposed and try to understand what it's doing.