Duplicate Remover

LiveCode is the premier environment for creating multi-platform solutions for all major operating systems - Windows, Mac OS X, Linux, the Web, Server environments and Mobile platforms. Brand new to LiveCode? Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

deeverd
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 165
Joined: Mon Jul 16, 2007 11:58 pm
Location: Las Vegas, NV

Duplicate Remover

Post by deeverd » Wed Jan 16, 2008 6:48 pm

Hello Everyone,

Yesterday, I happened to find a fascinating code written by Marielle in a forum posting some time ago. She mentioned it as being perhaps the fastest script she had found to remove duplicates from a list. I can see that it is a custom function and an array, and I've experimented with it for some time but don't know how to get it to work.

Here's some of the things I'm a bit confused about:
I see the function name is "list.deleteDuplicates" so I was wondering if the "." between "list" and "delete" is supposed to be there?

Also, I'm not sure how one would call this function. For instance, would you have a mouseDown handler that puts your list into an array called "list.deleteDuplicates" first or what?

Code: Select all

function list.deleteDuplicates pList 
  split pList by cr and space 
  combine pList by cr and space 
  return pList 
end list.deleteDuplicates

I have made a number of experimental programs that will delete duplicate words from a list but they're a bit too slow when the list consists of tens of thousands of words. If anyone could briefly explain the above script to me or show the script that would call this handler into action, it would be greatly appreciated.

Basically my big lists of words are in single column, alphabetical order, on separate lines with a CR, but I know that I have many repeated words in those databases that are supposed to have only one of each word, and I have quite a few different databases to clean up, so a fast way to do this would be a fantastic asset.

As always, thanks in advance for all the knowledge this forum imparts.
All the best, deeverd

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9867
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Post by FourthWorld » Wed Jan 16, 2008 7:35 pm

Dot notation is commonly used in object-oriented systems, usually to denote methods or properties of a class. Since Rev is not an OOP language I'm not sure why the dot is present there, but dots have no harmful side-effect that I'm aware of.

As for the algorithm itself, the split command turns a return-delimited list into a set of array elements. Arrays are like a bookshelf, in which each shelf has a name. You can put whatever you want into a shelf, but since you must refer to that shelf by name there can only be one shelf with any given name.

So if your list is:

Bob
Carol
Carol
Ted

...then the split command is effectively doing this:

make an array element named "Bob"
make an array element named "Carol"
make an array element named "Carol"
make an array element named "Ted"

The second and third statements do the same thing, so the second has no effect, with the result being an array with three elements.

The combine command takes the various "shelves" of an array and puts them into a return-delimited list. Since the array contains three items, the resulting list will be:

Bob
Carol
Ted


Did that help?
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

deeverd
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 165
Joined: Mon Jul 16, 2007 11:58 pm
Location: Las Vegas, NV

Post by deeverd » Wed Jan 16, 2008 10:48 pm

Hi Richard,

I just grabbed and printed this at work and will have time to digest it while sitting at the eye doctor's waiting for my wife's appointment. I'll let you know if it all sinks in Ok, plus I'll have a chance to experiment with it later today. In the meantime, thanks for the response.

Actually, I remember some excellent MetaCard array advice you sent in the summer that is definitely the fastest way I've ever found to get a word count of the individual words in an array. Something I found really interesting about it was that when a formatted MS Word text was imported, it even gave an instant return of all the formatting marks, all the 26 letters of the alphabet, etc. Of course, when a list was imported, it didn't do that and worked as expected, but still, those represent some valuable pieces of info, too.

Anyway, a couple questions I have about that code, which I'll paste in here below to remind you, would be, "Is there a way to modify that super fast script so that it would compare a database list of words to the words it imports from a text or Word file?" If so, would a repeat structure be the fastest way to accomplish this task or is there a better way? I've created a number of ways of doing this but it always takes too long because I have some database lists of words that are 14,000 words long that are being compared to a 120,000 word manuscript.

Also, would it speed things up if there was a way to just get a tally of the number of matches, without getting feedback on exactly what words in the database matched with the words from the text? For the most part, I only need the total number of matches from each database without all the details.

Here's the incredibly fast code I received from you:

Code: Select all

on mouseUp 
put empty into field "result" 
answer file "Select a text file for input:" 
if it is empty then exit mouseUp 
# let user know we're working on it 
set the cursor to watch 
put it into inputFile 
open file inputFile for read 
read from file inputFile until eof 
put it into fileContent 
close file inputFile 
# wordCount is an associative array, its indexes are words 
# with the contents of each element being number of times 
# that word appears 
repeat for each word w in fileContent 
add 1 to wordCount[w] 
end repeat 
# copy all the indexes that is in the wordCount associative array 
put keys(wordCount) into keyWords 
# sort the indexes -- keyWords contains a list of elements in array 
sort keyWords 
repeat for each line l in keyWords 
put l & tab & wordCount[l] & return after displayResult 
end repeat 
put displayResult into field "result" 
end mouseUp
Thanks so much again, deeverd

deeverd
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 165
Joined: Mon Jul 16, 2007 11:58 pm
Location: Las Vegas, NV

Post by deeverd » Thu Jan 17, 2008 3:58 pm

Hi Richard,

Please hold off on your answer for a bit. Last night I had a chance to read over your advice, plus I cut and pasted together every bit of array information on this forum that I could find.

After reading your reply, I was a bit embarrassed to discover that marielle's code I was looking at was really rather simple. It was the "list.deleteDuplicates" that I was reading more into than was really there. I thought there was some sort of magic script I hadn't discovered yet that was part of the "." in that part of the script, when in fact it was only the name of the variable array.

Anyway, I think I actually understand all this fairly well now, which is about time since I've been "code cannibalizing" array scripts for over 4 months now. It's nice when the lights finally come on. In fact, when I get my own match text array code put together, hopefully over the next couple days, I'll post it here for anyone else to use.

There is one thing, however, I am still wondering how to do and that is to create a "numbers only" match text array that just gives me feedback on the numbers of matches the array text made with the array database, without receiving a report on the individual words, since I'm assuming that might make things even faster.

All the best, deeverd

deeverd
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 165
Joined: Mon Jul 16, 2007 11:58 pm
Location: Las Vegas, NV

Post by deeverd » Thu Jan 17, 2008 8:06 pm

Hello All,

I think I may have just found the right command to get a numbers-only list of matches in a very fast way:

I just found info on the "intersect" command for arrays. In that case, it stands to reason that all I have to do is make a word count on my text before turning it into an array, and then do a word count on that recombined text after intersecting it with a database array.

The difference between the beginning and the ending word count of the text should be the number of matches it made with the database field... if my logic is correct.

Now all I have to do is to figure out exactly how to intersect two arrays correctly... so it's back to the manuals.

Cheers, deeverd

andyh1234
Posts: 442
Joined: Mon Aug 13, 2007 4:44 pm
Location: Eccles UK
Contact:

Post by andyh1234 » Tue Feb 10, 2009 4:25 pm

Ive been trying to figure out a slightly more difficult duplicate sort without much luck. Basically I have a variable with 10 entries separated by tabs and I just need to check the first two items for duplication.

The code I have at the moment that works, albeit slowly is simply

Code: Select all

      
repeat with V = (the number of lines in tList) down to 2
         if (item 1 of line v of tlist = item 1 of line v -1 of tlist) and (item 2 of line v of tlist = item 2 of line v -1 of tlist)  then
            -- duplicate found
            delete line v of tlist
         end if
end repeat
Its fine as long as I keep the number of lines in tlist under 300. There must be a better (faster) way to do this, any ideas that could point me in the right direction????

Thanks

Andy

Mark
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 5150
Joined: Thu Feb 23, 2006 9:24 pm
Contact:

Post by Mark » Tue Feb 10, 2009 6:23 pm

Hi Andy,

Are you looking for something like this?

Code: Select all

function removeDups theOldList
  put cr into myNewList
  repeat for each line myLine in theOldList
    if not (cr & item 1 to 2 of myLine & the itemDelimiter) is in myNewList then
      put myLine & cr after myNewList
    end if
  end repeat
  return char 2 to -2 of myNewList
end removeDups
(Untested, please beware fo typoz)

Mark
The biggest LiveCode group on Facebook: https://www.facebook.com/groups/livecode.developers
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode

andyh1234
Posts: 442
Joined: Mon Aug 13, 2007 4:44 pm
Location: Eccles UK
Contact:

Post by andyh1234 » Wed Feb 11, 2009 2:20 pm

Thanks Mark, thats just what I was looking for, just couldnt get a handle on where to start!

Andy

kotikoti
Posts: 72
Joined: Tue Apr 15, 2008 7:35 pm

Post by kotikoti » Sat Jun 13, 2009 2:19 pm

Hi all,
Have tried to use the above code but can't figure out what it is doing.
I need to clean a field containing data as follows

1
1
8
8
1
8
7
1
8
9

to remove duplicates. Any assistance will be appreciated
Build 160
Version 2.9.0
Platform: Windows

bn
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 4036
Joined: Sun Jan 07, 2007 9:12 pm
Location: Bochum, Germany

Post by bn » Sat Jun 13, 2009 3:27 pm

Kotikoti,

Mark's function was for a different situation. I adapted Mark's solution to your problem:

Code: Select all

on mouseUp
   put removeDups(field "myField") into field "myField"
   -- optionally sort the field numeric
   sort field "myField" numeric
end mouseUp


function removeDups theOldList 
   put cr into myNewList 
   repeat for each line myLine in theOldList 
      if not (myLine is in myNewList) then 
         put myLine & cr after myNewList 
      end if 
   end repeat 
   return char 2 to -2 of myNewList 
end removeDups
this goes into a button
adjust your field name
regards
Bernd

kotikoti
Posts: 72
Joined: Tue Apr 15, 2008 7:35 pm

Post by kotikoti » Sat Jun 13, 2009 5:40 pm

Thanks bn will have a go with suggested code
Build 160
Version 2.9.0
Platform: Windows

trevix
Posts: 971
Joined: Sat Feb 24, 2007 11:25 pm
Location: Italy
Contact:

Re: Duplicate Remover

Post by trevix » Tue May 02, 2023 4:05 pm

I have a similar small problem which I find hard to solve:

tList contains
AA1_AA2
AA2_AA1
tSearch is "AA"
I need to limit the list to the search and removing duplicates (done), but also removing lines where the "_" delimited items are just inverted. That is, to come up with
AA1_AA2 or AA2_AA1, but not both.

Code: Select all

 filter tList with ( "*" & tSearch & "*")  into tList1
 if tList1 is not empty then
     --delete duplicate
     split tList1 by cr and "" --not using "space because it adds a space to the names in the combine
     combine tList1 by cr and ""
     ----
    --here I should try to remove inverted duplicates
end if
I tried with different approches, to no avail. For example:

Code: Select all

put tList1 into tList2
repeat for each line tLine in tList2
     put item 2 of tLine & "_" & item 1 of tLine into tLine --invert
     put lineOffset(tLine, tList1 ) into tNum
     delete line tNum of tList1
     delete line tNum of tList2 --here something doesn't fit because the "repeat" will still do another reiteration
end repeat
Any idea?

Thanks
Trevix
Trevix
OSX 14.3.1 xCode 15 LC 10 DP7 iOS 15> Android 7>

rkriesel
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 119
Joined: Thu Apr 13, 2006 6:25 pm

Re: Duplicate Remover

Post by rkriesel » Wed May 03, 2023 3:02 am

trevix wrote:
Tue May 02, 2023 4:05 pm
I need to limit the list to the search and removing duplicates (done), but also removing lines where the "_" delimited items are just inverted.
Hi, Trevix. Here's a technique that prevents duplicates and reversals in data like yours. And it avoids repeatedly invoking lineOffset.

Code: Select all

on test
   local t, k, a
   put "a,b" & cr after t
   put "a,b" & cr after t
   put "b,a" & cr after t
   
   repeat for each line l in t
      put true into a[l] -- to prevent duplicates
      get item 2 of l, item 1 of l -- reversal
      if not a[it] then -- if the reversal is a new one, then ...
         put l & cr after k -- to keep the line
         put true into a[it] -- to prevent reversals
      end if
   end repeat
   
   breakpoint -- see results in k for keepers
end test
Does it work for you?
-- Dick

trevix
Posts: 971
Joined: Sat Feb 24, 2007 11:25 pm
Location: Italy
Contact:

Re: Duplicate Remover

Post by trevix » Wed May 03, 2023 11:15 am

MAGIC!!
Thanks a lot. Sometime I get stuck on these little things and it makes me think that I may not be so versed on logic.
Oh well, programming is so much fun...
Trevix
OSX 14.3.1 xCode 15 LC 10 DP7 iOS 15> Android 7>

trevix
Posts: 971
Joined: Sat Feb 24, 2007 11:25 pm
Location: Italy
Contact:

Re: Duplicate Remover

Post by trevix » Sun May 07, 2023 9:36 pm

Interesting:
removing duplicates from a list using

Code: Select all

split pList by cr and "" 
combine pList by cr and "
does not remove duplicates that differentiate by having a capital char in the name, even setting the caseSensitive to false before the code
Trevix
OSX 14.3.1 xCode 15 LC 10 DP7 iOS 15> Android 7>

Post Reply

Return to “Getting Started with LiveCode - Experienced Developers”