## Count # of Unique Two-Word Phrases (Strings/Items)

Got a LiveCode personal license? Are you a beginner, hobbyist or educator that's new to LiveCode? This forum is the place to go for help getting started. Welcome!

Moderators: Klaus, FourthWorld, heatherlaine, kevinmiller

theotherbassist
Posts: 114
Joined: Thu Mar 06, 2014 9:29 am
Location: UK

### Count # of Unique Two-Word Phrases (Strings/Items)

I'm using the following code to tally the number of times unique words appear in some text.

Code: Select all

``````on mouseUp
createwordfrequencyanalysis field "list"
end mouseUp

on createwordfrequencyanalysis pTexts
get listFrequencies(pTexts)
put it into fld "freqlist"
end createwordfrequencyanalysis

function listFrequencies pText
local tArray, tList, tTotal
repeat for each trueWord tWord in pText
add 1 to tArray[tWord] -- add 1 to the word count of each word
end repeat
repeat for each key tKey in tArray
if length(tKey) < 4 then next repeat
put tKey & comma & tArray[tKey] & cr after tList -- put each unique word and its frequency into tList
end repeat
sort tList descending numeric by item 2 of each
return line 1 to -1 of tList
end listFrequencies``````
How can I modify this so that I'm tallying the number of unique *two word* phrases? Arrays calculate quickly, and I want to do it that way. I'd rather not use x and y variables with if statements to add a word to the subsequent word to form a string a hundred thousand times.

dunbarx
VIP Livecode Opensource Backer
Posts: 6462
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

Hi.

Someone more clever than I will undoubtedly come up with a solution along the lines you asked for.

I am the one who touts "a little analysis is worth a million lines of code" But after a little while, I resorted to an ordinary brute-force method. Maybe it will be of some use to you. I randomly sprinkled "my dog has fleas" 2000 times in a block of text of 26000 words. This in a field 3. in a button:

Code: Select all

``````on mouseUp
put fld 3 into tText
put "dog" into firstWord
put "has" into secondWord
put 0 into CharsToSkip
repeat
put wordoffset(firstWord,tText,charsToSkip) into foundNumber
if foundNumber <> 0 then
put foundNumber & comma after accum
put sum(accum) into charsToSkip
put charsToSkip & comma after wordIndexes
else exit repeat
end repeat

repeat for each item tItem in wordIndexes
if word (tItem + 1) of tText = secondWord then put tItem & comma after foundList
end repeat
put foundlist into fld "allThewordPairOffsets"
end mouseUp``````
Took nine seconds to find all the pairs, arranged as a list of the word offsets in the text. You could merely count the instances, of course, slightly easier.

theotherbassist
Posts: 114
Joined: Thu Mar 06, 2014 9:29 am
Location: UK

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

That's similar to what I had been thinking I'd do if I couldn't figure out some way to do it with an array. I appreciate the code! My reservation about going the "brute force" method is that I want to find the freq of every single word pair in the field. When I do it for individual truewords, it takes mere milliseconds.

I'm basically looking to display the most common keywords after I gather RSS news feed data from up to 50 feeds that average about 30 articles each at any given time. I've got all the XML legwork done, and the feeds import fine. The single-word keyword method is impressively quick, considering the massive amount of text that I throw into the field. Ultimately, the XML node reading takes a couple seconds per feed already, so I'm looking to cut down on any lengthy analytics so the whole app doesn't just become a waiting game.

dunbarx
VIP Livecode Opensource Backer
Posts: 6462
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

Hi.

Check out a post I made in the "Feature Request" area, "Custom Chunk"

Craig

dunbarx
VIP Livecode Opensource Backer
Posts: 6462
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

I am still in v. 6.7, so if you have read the thread in the "Feature Requests" pane, the newest itemDel is your friend. It can be the "custom chunk".

I am not sure where the itemDelimiter is no longer restricted to single characters, but rather to any string (I think this is right), so if you set the itemDel to two words and use the itemOffset, the code will become much more compact, and you can run the array counter directly.

Craig

[-hh]
VIP Livecode Opensource Backer
Posts: 2234
Joined: Thu Feb 28, 2013 11:52 pm
Location: Göttingen, DE

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

... was wrong, sorry ...
Last edited by [-hh] on Mon Apr 11, 2016 9:12 pm, edited 3 times in total.
shiftLock happens

theotherbassist
Posts: 114
Joined: Thu Mar 06, 2014 9:29 am
Location: UK

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

Wow, this thread blew up all at once. I actually engineered a solution in the meantime involving functions that place ordered truewords into another variable and THEN into the array. I've got functions now that do this for up to four-word combinations. I'm on my phone now, but I'll post code soon. Good posts guys. Appreciate it.

FourthWorld
VIP Livecode Opensource Backer
Posts: 7270
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

-hh wrote:kernel[0]: process LiveCode-Indy[846] thread 82336 caught burning CPU! It used more than 50% CPU (Actual recent usage: 91%) over 180 seconds. thread lifetime cpu usage 90.046699 seconds, (89.323192 user, 0.723507 system) ledger info: balance: 90005959970 credit: 90010116731 debit: 4156761 limit: 90000000000 (50%) period: 180000000000 time since last refill (ns): 97917706200
ReportCrash[857]: Invoking spindump for pid=852 thread=83578 percent_cpu=73 duration=124 because of excessive cpu utilization
Did you file a bug report on that? I'm sure the team would want to fix that if they have the crash info and the sample code.
Community volunteer LiveCode Community Liaison

LiveCode development, training, and consulting services: Fourth World Systems: http://FourthWorld.com

[-hh]
VIP Livecode Opensource Backer
Posts: 2234
Joined: Thu Feb 28, 2013 11:52 pm
Location: Göttingen, DE

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

... There is a bug, but my logic was wrong, sorry ...
Last edited by [-hh] on Mon Apr 11, 2016 9:11 pm, edited 1 time in total.
shiftLock happens

Mark
Livecode Opensource Backer
Posts: 5142
Joined: Thu Feb 23, 2006 9:24 pm
Contact:

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

I made a text analysis tool, a long time ago. Here's how I count the words:

Code: Select all

``````repeat for each word myWord in myText
if myArray[myWord] is empty then
put 1 into myArray[myWord]
else
end if
end repeat``````
Before starting the count, you might want to remove all non-alphanumeric characters except for white space. You could also compare this with the other solutions give in the above and see which is faster.

Kind regards,

Mark
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode

[-hh]
VIP Livecode Opensource Backer
Posts: 2234
Joined: Thu Feb 28, 2013 11:52 pm
Location: Göttingen, DE

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

... was wrong, sorry ...
Last edited by [-hh] on Mon Apr 11, 2016 9:10 pm, edited 1 time in total.
shiftLock happens

TerryL
Posts: 69
Joined: Sat Nov 23, 2013 8:57 pm

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

Filter would be another fast method, especially to display the lines that contained the search string, but it wouldn't accurately count occurrences if the search string appeared more than once/line. For that first filter then use the offset() routine to count occurrences with the smaller subset of lines. The new itemDel string and number(items) seem best for > v6. Terry

Code: Select all

``````on mouseUp  --keep lines containing search string
put fld "A" into tTemp
filter tTemp with ("*"& "my dog" &"*")  --lines containing
--filter tTemp with ("my dog" &"*")  --lines beginning with
--filter tTemp with ("my dog")  --lines only of
put number(lines in tTemp) &&"lines" &cr& tTemp
end mouseUp
``````
Beginner Lab (LiveCode tutorial) and StarterKit (my public stacks)
https://tlittle72.neocities.org/info.html#26Anchor

Mark
Livecode Opensource Backer
Posts: 5142
Joined: Thu Feb 23, 2006 9:24 pm
Contact:

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

hh,

Why do you use 3 repeat loops and even a nested repeat loop? That seems superfluous to me.

Mark
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode

dunbarx
VIP Livecode Opensource Backer
Posts: 6462
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

If all the OP wanted to do was count two-word phrases, and not find them in their offsets as well, Then why not just(in v.7 and above):

Code: Select all

``````set the itemDel to "has fleas"
answer the number of items of fld "yourField" - 1``````
Rereading the very first post, this is all that was asked for. The new ItemDel can do much, freed from a single char.

Craig

[-hh]
VIP Livecode Opensource Backer
Posts: 2234
Joined: Thu Feb 28, 2013 11:52 pm
Location: Göttingen, DE

### Re: Count # of Unique Two-Word Phrases (Strings/Items)

... was wrong, sorry ...
Last edited by [-hh] on Mon Apr 11, 2016 9:09 pm, edited 1 time in total.
shiftLock happens