Faster way to remove duplicate lines?

edgore · Post by **edgore** » Mon Jul 29, 2013 4:23 pm

I have some very long (25K lines) lists that I would like to filter for duplicates, and I am wondering if there is a faster way to do this than what I am currently doing.

Right now I have tried

Code: Select all

on mouseUp
   put field 1 into Temp --field 1 contains about 25K lines
   put empty into noDupes
   put the milliseconds into timer
   repeat for each line theLine in Temp
      if noDupes contains theLine is false then
         put theLine & return after noDupes
      end if
   end repeat
   delete the last char of noDupes
   put (the milliseconds - timer) & return & noDupes
end mouseUp

which takes about 108399 milliseconds, and I tried

Code: Select all

on mouseUp
   put field 1 into Temp --field 1 contains about 25K lines
   put empty into noDupes
   put the milliseconds into timer
   repeat for each line theLine in Temp
      if theLine is not among the lines of noDupes then
         put theLine & return after noDupes
      end if
   end repeat
   delete the last char of noDupes
   put (the milliseconds - timer) & return & noDupes
end mouseUp

which takes about 107540 milliseconds, a difference that could come down to my having switched over to email while it was running.

Is there any sneaky way to accomplish this faster? I know that it's grinding because each run through the loop has to compare against an ever increasing list of unique lines, but I can't think of away to avoid that.

edgore · Post by **edgore** » Mon Jul 29, 2013 4:53 pm

Oh, and I also tried this:

Code: Select all

on mouseUp
   put deleteDuplicateLines(field 1) into noDupes
   put the number of lines in field 1 &&  the number of lines in noDupes
end mouseUp

function deleteDuplicateLines theList 
   split theList by return
   combine theList by return
   return theList 
end deleteDuplicateLines

which is almost instantaneous, but doesn't seem to actually remove any lines (the line count *should* drop from 25778 to 23440 in my stack), instead this script reports back 25778 before and after

edgore · Post by **edgore** » Mon Jul 29, 2013 4:58 pm

Aha! I figured it out!

the script below results in ridiculously fast duplicate removal:

Code: Select all

on mouseUp
   put deleteDuplicateLines(field 1) into noDupes
   put the number of lines in field 1 &&  the number of lines in noDupes
end mouseUp

function deleteDuplicateLines theList --this function removes duplicate lines from a list
   split theList by return and space
   combine theList by return and space
   replace space & return with return in theList
   delete the last char of theList
   return theList
end deleteDuplicateLines

As a side effect, it also sorts the lines of the container.

It also shows I need to spend more time learning how split and combine work.

Simon · Post by **Simon** » Mon Jul 29, 2013 7:46 pm

Wow,
Interesting use of split/combine

Simon

dunbarx · Post by **dunbarx** » Mon Jul 29, 2013 9:33 pm

Edgore.

I am with Simon. Simple when you see it. Devilishly clever of you.

Craig

edgore · Post by **edgore** » Mon Jul 29, 2013 10:04 pm

I figured there *had* to be some way of leveraging unique keys in arrays to do this, just took a while to work it out. Hopefully there is not a downside that I am not thinking about. My initial need, and all of my testing were for a single word on the line, but I just tested it with longer lines and spaces in the line and it seems to still work just fine.

Edit - don't trust the sort though - it's strange with alpha characters. Do your own sort if you need that.

edgore · Post by **edgore** » Mon Jul 29, 2013 11:53 pm

Found a bug - it adds a space on to the end of each line. I have updated the routine above to correct this

MaxV · Post by **MaxV** » Wed Jul 31, 2013 1:09 pm

Livecode embed SQLITE, so you can use it this way:

Code: Select all

put the text of field 1 into Temp
put "./temp.sqlite" into tDatabasePath
put revOpenDatabase("sqlite", tDatabasePath, , , , ) into connID
put "CREATE TABLE temp (list TEXT) ; " into tSQL  #there are a lot of temp variables here... ;-)
revExecuteSQL connID,tSQL       
put "INSERT INTO temp   VALUES " into tSQL
repeat for each line tempLine in Temp
put " ('" & tempLine & "') ," after tSQL
end repeat
delete last char of tSQL #this remove the last comma
put " ; " after tSQL
revExecuteSQL connID,tSQL       
put "SELECT DISTINCT list FROM temp  ; " into  tSQL
put revDataFromQuery(tab,return,connID,tSQL) into tRecords

tRecords contains all lines without duplicates!
SQLite engine is very quick.
Remeber to delete temp.sqlite each time before using again the code.

edgore · Post by **edgore** » Wed Jul 31, 2013 5:03 pm

That requires having an SQLite database running outside of Livecode though, right? I am trying to avoid external dependencies. I had thought about using a database to accomplish this, but that opens up a set of external requirements for the code to function that I would like to avoid. Also, in a multi user environment (like on server) you would end up having to create and delete a lot of randomly, uniquely named tables each time the delete duplicates code ran in any script, which seems like asking for trouble.

The array solution seems to be working in all my testing so far, and as far as I know will run anywhere with just the engine installed and nothing but Livecode.

Correct me if my assumptions are wrong - Livecode doesn't have a built-in SQLite database in the engine that I didn't know about, does it? (that would be awesome, if it did...)

Edit - after doing some reading, I see that yes, SQLite is basically implemented in the engine and I can now use it for all sorts of useful things. I still like the array solution for removing duplicates, for the reasons mentioned above about multiple users, but I have a bunch of situations where I want to cache information from other databases locally for use in multiple scripts for multiple users and it looks like SQLite will be perfect for that! I would just have to make sure that I delete the cache every 24 hours or so.

dunbarx · Post by **dunbarx** » Wed Jul 31, 2013 5:14 pm

I think Max was experimenting, like all of us.

I think your array gadget is as good as it gets.

To look at it another way:

on mouseUp
put temp into myArray[test]
put temp into myArray[test]
put temp into myArray[test]
end mouseUp

gives only one value for the single key "test". Love that...

Craig

edgore · Post by **edgore** » Wed Jul 31, 2013 8:05 pm

Oh, agreed - all experimentation is good. I was just saying why the DB solution was not a good fit for what I was doing.

Also, there should be a section in the docs talking about using SQLite - the only reference I could find in the docs is a paragraph saying there is a driver for it. Since I didn't know anything about what SQLite was, I didn't realize that this meant there was a full implementation of it- I assumed that you needed to have a DB instance running someplace for your application to talk to (that's how it works with everything I have to deal with - Oracle, DB2, MySQL). Finding out that the DB functionality is built into the driver and that I can just create and use databases as needed in Livecode opens up a lot of possibilities. There should be a huge banner that says "HEY! THIS MEANS THERE IS A FUNCTIONAL SQL DATABASE INCLUDED IN LIVECODE DUMMY!". That, I might have noticed.

(Of course, now I want to be able to create virtual database objects in memory, without having to write external files...)

MaxV · Post by **MaxV** » Thu Aug 01, 2013 1:15 pm

Well, you don't need real external database. When you write

Code: Select all

put "./temp.sqlite" into tDatabasePath
put revOpenDatabase("sqlite", tDatabasePath, , , , ) into connID

if Livecode don't found it, Livecode creates the database.

just add:

Code: Select all

put "DROP TABLE temp  ; " into tSQL  #there are a lot of temp variables here... ;-)
revExecuteSQL connID,tSQL

at the end of your function, and you don't need to delete any file.

Example:

Code: Select all

on nodup
put the text of field 1 into Temp
put "./temp.sqlite" into tDatabasePath
put revOpenDatabase("sqlite", tDatabasePath, , , , ) into connID
put "CREATE TABLE temp (list TEXT) ; " into tSQL  #there are a lot of temp variables here... ;-)
revExecuteSQL connID,tSQL       
put "INSERT INTO temp   VALUES " into tSQL
repeat for each line tempLine in Temp
put " ('" & tempLine & "') ," after tSQL
end repeat
delete last char of tSQL #this remove the last comma
put " ; " after tSQL
revExecuteSQL connID,tSQL       
put "SELECT DISTINCT list FROM temp  ; " into  tSQL
put revDataFromQuery(tab,return,connID,tSQL) into tRecords
put "DROP TABLE temp  ; " into tSQL  
revExecuteSQL connID,tSQL 
end nodup

This way you don't care if there is a DB or not.

Klaus · Post by **Klaus** » Thu Aug 01, 2013 1:42 pm

Hi Ed,

sorry, for chiming in a bit late.

Please try the ARRAY approach to this problem.
Tested with 10,000 (short) lines with example data -> first_name name,
which contained 22 duplicates and it took 19 millisecs

Code: Select all

on mouseUp
      put field 1 into Temp 
      ## field 1 contains 10K lines in my example
      put empty into tArray
      put the milliseconds into timer

     ## We "abuse" the KEYs of that array for our purpose!
      repeat for each line theLine in Temp
          add 1 to tArray[theLine]
      end repeat
   put the keys of tArray into fld 2
   put (the milliseconds - timer) 
end mouseUp

At least worth a try!

EDIT: Just tested with 30,000 lines and took 22 millisecs

Best

Klaus

sritcp · Post by **sritcp** » Thu Aug 01, 2013 2:15 pm

edgore:

Am I correct in understanding that your array method will only work if the first word of each line (the "key") is unique for each unique line?
That is, if there are two non-duplicate lines that happen to share the same first word, your routine will delete all but the first such occurrence?

Sri.

edgore · Post by **edgore** » Thu Aug 01, 2013 4:44 pm

@sritcp
Just tested it with a field that contains

item 1
item 2
item 3
item 4
item 4
item 4
item 5

And you are right, it comes back with a single line, containing "item 5", so it is only working when the first word is unique. I knew there had to be a catch - that is what I get for playing with arrays when I don't really understand them.

It still works for what I was doing, but it's not a general solution. I need to look into it a little more and see what else I can come up with - maybe using an unusual character for the second delimiter? Something taht is not going to show up in text generally? I also need to look at what Klaus just proposed and try to understand what it's doing.

LiveCode Forums.

Faster way to remove duplicate lines?

Faster way to remove duplicate lines?

Re: Faster way to remove duplicates?

Re: Faster way to remove duplicates?

Re: Faster way to remove duplicates?

Re: Faster way to remove duplicates?

Re: Faster way to remove duplicates?

Re: Faster way to remove duplicates?

Re: Faster way to remove duplicate lines?

Re: Faster way to remove duplicate lines?

Re: Faster way to remove duplicate lines?

Re: Faster way to remove duplicate lines?

Re: Faster way to remove duplicate lines?

Re: Faster way to remove duplicate lines?

Re: Faster way to remove duplicate lines?

Re: Faster way to remove duplicate lines?