Got a LiveCode personal license? Are you a beginner, hobbyist or educator that's new to LiveCode? This forum is the place to go for help getting started. Welcome!
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller
-
theotherbassist
- Posts: 115
- Joined: Thu Mar 06, 2014 9:29 am
- Location: UK
Post
by theotherbassist » Thu Apr 07, 2016 11:40 pm
I'm using the following code to tally the number of times unique words appear in some text.
Code: Select all
on mouseUp
createwordfrequencyanalysis field "list"
end mouseUp
on createwordfrequencyanalysis pTexts
get listFrequencies(pTexts)
put it into fld "freqlist"
end createwordfrequencyanalysis
function listFrequencies pText
local tArray, tList, tTotal
repeat for each trueWord tWord in pText
add 1 to tArray[tWord] -- add 1 to the word count of each word
end repeat
repeat for each key tKey in tArray
if length(tKey) < 4 then next repeat
put tKey & comma & tArray[tKey] & cr after tList -- put each unique word and its frequency into tList
end repeat
sort tList descending numeric by item 2 of each
return line 1 to -1 of tList
end listFrequencies
How can I modify this so that I'm tallying the number of unique *two word* phrases? Arrays calculate quickly, and I want to do it that way. I'd rather not use x and y variables with if statements to add a word to the subsequent word to form a string a hundred thousand times.
-
dunbarx
- VIP Livecode Opensource Backer
- Posts: 9648
- Joined: Wed May 06, 2009 2:28 pm
- Location: New York, NY
Post
by dunbarx » Fri Apr 08, 2016 4:43 am
Hi.
Someone more clever than I will undoubtedly come up with a solution along the lines you asked for.
I am the one who touts "a little analysis is worth a million lines of code" But after a little while, I resorted to an ordinary brute-force method. Maybe it will be of some use to you. I randomly sprinkled "my dog has fleas" 2000 times in a block of text of 26000 words. This in a field 3. in a button:
Code: Select all
on mouseUp
put fld 3 into tText
put "dog" into firstWord
put "has" into secondWord
put 0 into CharsToSkip
repeat
put wordoffset(firstWord,tText,charsToSkip) into foundNumber
if foundNumber <> 0 then
put foundNumber & comma after accum
put sum(accum) into charsToSkip
put charsToSkip & comma after wordIndexes
else exit repeat
end repeat
repeat for each item tItem in wordIndexes
if word (tItem + 1) of tText = secondWord then put tItem & comma after foundList
end repeat
put foundlist into fld "allThewordPairOffsets"
end mouseUp
Took nine seconds to find all the pairs, arranged as a list of the word offsets in the text. You could merely count the instances, of course, slightly easier.
-
theotherbassist
- Posts: 115
- Joined: Thu Mar 06, 2014 9:29 am
- Location: UK
Post
by theotherbassist » Fri Apr 08, 2016 9:34 am
That's similar to what I had been thinking I'd do if I couldn't figure out some way to do it with an array. I appreciate the code! My reservation about going the "brute force" method is that I want to find the freq of every single word pair in the field. When I do it for individual truewords, it takes mere milliseconds.
I'm basically looking to display the most common keywords after I gather RSS news feed data from up to 50 feeds that average about 30 articles each at any given time. I've got all the XML legwork done, and the feeds import fine. The single-word keyword method is impressively quick, considering the massive amount of text that I throw into the field. Ultimately, the XML node reading takes a couple seconds per feed already, so I'm looking to cut down on any lengthy analytics so the whole app doesn't just become a waiting game.
-
dunbarx
- VIP Livecode Opensource Backer
- Posts: 9648
- Joined: Wed May 06, 2009 2:28 pm
- Location: New York, NY
Post
by dunbarx » Fri Apr 08, 2016 2:16 pm
Hi.
Check out a post I made in the "Feature Request" area, "Custom Chunk"
Craig
-
dunbarx
- VIP Livecode Opensource Backer
- Posts: 9648
- Joined: Wed May 06, 2009 2:28 pm
- Location: New York, NY
Post
by dunbarx » Fri Apr 08, 2016 5:34 pm
I am still in v. 6.7, so if you have read the thread in the "Feature Requests" pane, the newest itemDel is your friend. It can be the "custom chunk".
I am not sure where the itemDelimiter is no longer restricted to single characters, but rather to any string (I think this is right), so if you set the itemDel to two words and use the itemOffset, the code will become much more compact, and you can run the array counter directly.
Craig
-
[-hh]
- VIP Livecode Opensource Backer
- Posts: 2262
- Joined: Thu Feb 28, 2013 11:52 pm
- Location: Göttingen, DE
Post
by [-hh] » Fri Apr 08, 2016 5:56 pm
... was wrong, sorry ...
Last edited by
[-hh] on Mon Apr 11, 2016 9:12 pm, edited 3 times in total.
shiftLock happens
-
theotherbassist
- Posts: 115
- Joined: Thu Mar 06, 2014 9:29 am
- Location: UK
Post
by theotherbassist » Fri Apr 08, 2016 6:25 pm
Wow, this thread blew up all at once. I actually engineered a solution in the meantime involving functions that place ordered truewords into another variable and THEN into the array. I've got functions now that do this for up to four-word combinations. I'm on my phone now, but I'll post code soon. Good posts guys. Appreciate it.
-
FourthWorld
- VIP Livecode Opensource Backer
- Posts: 9823
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
-
Contact:
Post
by FourthWorld » Fri Apr 08, 2016 7:35 pm
-hh wrote:kernel[0]: process LiveCode-Indy[846] thread 82336 caught burning CPU! It used more than 50% CPU (Actual recent usage: 91%) over 180 seconds. thread lifetime cpu usage 90.046699 seconds, (89.323192 user, 0.723507 system) ledger info: balance: 90005959970 credit: 90010116731 debit: 4156761 limit: 90000000000 (50%) period: 180000000000 time since last refill (ns): 97917706200
ReportCrash[857]: Invoking spindump for pid=852 thread=83578 percent_cpu=73 duration=124 because of excessive cpu utilization
Did you file a bug report on that? I'm sure the team would want to fix that if they have the crash info and the sample code.
-
[-hh]
- VIP Livecode Opensource Backer
- Posts: 2262
- Joined: Thu Feb 28, 2013 11:52 pm
- Location: Göttingen, DE
Post
by [-hh] » Fri Apr 08, 2016 9:31 pm
... There is a bug, but my logic was wrong, sorry ...
Last edited by
[-hh] on Mon Apr 11, 2016 9:11 pm, edited 1 time in total.
shiftLock happens
-
Mark
- Livecode Opensource Backer
- Posts: 5150
- Joined: Thu Feb 23, 2006 9:24 pm
-
Contact:
Post
by Mark » Sat Apr 09, 2016 8:41 am
I made a text analysis tool, a long time ago. Here's how I count the words:
Code: Select all
repeat for each word myWord in myText
if myArray[myWord] is empty then
put 1 into myArray[myWord]
else
add 1 to myArray[myWord]
end if
end repeat
Before starting the count, you might want to remove all non-alphanumeric characters except for white space. You could also compare this with the other solutions give in the above and see which is faster.
Kind regards,
Mark
The biggest LiveCode group on Facebook: https://www.facebook.com/groups/livecode.developers
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode
-
[-hh]
- VIP Livecode Opensource Backer
- Posts: 2262
- Joined: Thu Feb 28, 2013 11:52 pm
- Location: Göttingen, DE
Post
by [-hh] » Sat Apr 09, 2016 12:55 pm
... was wrong, sorry ...
Last edited by
[-hh] on Mon Apr 11, 2016 9:10 pm, edited 1 time in total.
shiftLock happens
-
TerryL
- Posts: 78
- Joined: Sat Nov 23, 2013 8:57 pm
Post
by TerryL » Sat Apr 09, 2016 6:36 pm
Filter would be another fast method, especially to display the lines that contained the search string, but it wouldn't accurately count occurrences if the search string appeared more than once/line. For that first filter then use the offset() routine to count occurrences with the smaller subset of lines. The new itemDel string and number(items) seem best for > v6. Terry
Code: Select all
on mouseUp --keep lines containing search string
put fld "A" into tTemp
filter tTemp with ("*"& "my dog" &"*") --lines containing
--filter tTemp with ("my dog" &"*") --lines beginning with
--filter tTemp with ("my dog") --lines only of
put number(lines in tTemp) &&"lines" &cr& tTemp
end mouseUp
Beginner Lab (LiveCode tutorial) and StarterKit (my public stacks)
https://tlittle72.neocities.org/info.html
-
Mark
- Livecode Opensource Backer
- Posts: 5150
- Joined: Thu Feb 23, 2006 9:24 pm
-
Contact:
Post
by Mark » Sat Apr 09, 2016 10:23 pm
hh,
Why do you use 3 repeat loops and even a nested repeat loop? That seems superfluous to me.
Mark
The biggest LiveCode group on Facebook: https://www.facebook.com/groups/livecode.developers
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode
-
dunbarx
- VIP Livecode Opensource Backer
- Posts: 9648
- Joined: Wed May 06, 2009 2:28 pm
- Location: New York, NY
Post
by dunbarx » Sat Apr 09, 2016 11:17 pm
If all the OP wanted to do was count two-word phrases, and not find them in their offsets as well, Then why not just(in v.7 and above):
Code: Select all
set the itemDel to "has fleas"
answer the number of items of fld "yourField" - 1
Rereading the very first post, this is all that was asked for. The new ItemDel can do much, freed from a single char.
Craig
-
[-hh]
- VIP Livecode Opensource Backer
- Posts: 2262
- Joined: Thu Feb 28, 2013 11:52 pm
- Location: Göttingen, DE
Post
by [-hh] » Sat Apr 09, 2016 11:43 pm
... was wrong, sorry ...
Last edited by
[-hh] on Mon Apr 11, 2016 9:09 pm, edited 1 time in total.
shiftLock happens