Re: Count # of Unique Two-Word Phrases (Strings/Items)
Posted: Sun Apr 10, 2016 12:11 am
... was wrong, sorry ...
Questions and answers about the LiveCode platform.
https://forums.livecode.com/
Code: Select all
put trueword m of pText & space & trueword (m + 1) of ptext & space & trueword (m + 2) of ptext & space & trueword (m + 3) of ptext & comma & comma after twoWordListCode: Select all
global twoWordList
on mouseUp
makewordList field "list"
createwordfrequencyanalysis twowordlist
replace ",," with "," in fld "freqlist"
put empty into twowordlist
end mouseUp
on makewordList ptexts
get make2wordList(ptexts)
put it into fld "freqlist"
end makewordList
function TwoWordLinesplit pText
put 1 into m
repeat until trueword (m + 3) of pText is empty
put trueword m of pText & space & trueword (m + 1) of ptext & space & trueword (m + 2) of ptext & space & trueword (m + 3) of ptext & comma & comma after twoWordList
add 1 to m
end repeat
end TwoWordLinesplit
function make2wordList pText
put 1 into L
put empty into twowordlist
repeat until line L of (pText) is empty
get twoWordLineSplit (line L of (pText))
add 2 to L
end repeat
end make2wordList
on createwordfrequencyanalysis pTexts
get listFrequencies(pTexts)
put it into fld "freqlist"
end createwordfrequencyanalysis
function listFrequencies pText
local tArray, tList, tTotal
set the itemdelimiter to ",,"
repeat for each item tWord in pText
add 1 to tArray[tWord] -- add 1 to the word count of each 4-word item
end repeat
repeat for each key tKey in tArray
if length(tKey) < 8 then next repeat
put tKey & comma & comma & tArray[tKey] & cr after tList -- put each unique 4-word item and its frequency into tList
end repeat
sort tList descending numeric by item -1 of each
return line 1 to -1 of tList
set the itemdelimiter to ","
end listFrequencies
Code: Select all
on mouseUp
put the millisecs into m1 --> start timing
lock screen; lock messages
set cursor to watch
#-- start prepare INPUT [fld "Input" is at about 65 KByte]
put fld "INPUT" into myInput -- or any other variable
put 0 into m ## --> each increase of m by one doubles the input <-- ##
repeat m --> 2^m copies for testing
put cr & myInput after myInput
end repeat
#-- end prepare INPUT
put the label of btn "phraseLength" into k0
if k0 = 1 then -- unique truewords
repeat for each trueword w in myInput
add 1 to countPairs[w]
end repeat
put the number of truewords of myInput into j
else
#-- =============== Essential part START: INPUT is variable myInput
#-- create truewords array
put 0 into j
repeat for each trueword w in myInput
add 1 to j; put w into myWords[j]
end repeat
#-- j is now the number of truewords in myInput
#-- create phrases and count their frequency
switch k0
case 2 -- two word phrases
repeat for each key k in myWords
#-- Now don't write "exit repeat". The keys are usually not sorted!!
if k = j then next repeat
add 1 to countPairs[myWords[k] & space & myWords[k+1]]
end repeat
break
case 3 --three word phrases
repeat for each key k in myWords
if k >= j-1 then next repeat
add 1 to countPairs[myWords[k] & space & myWords[k+1] \
& space & myWords[k+2]]
end repeat
break
case 4 --four word phrases
repeat for each key k in myWords
if k >= j-2 then next repeat
add 1 to countPairs[myWords[k] & space & myWords[k+1] \
& space & myWords[k+2] & space & myWords[k+3]]
end repeat
break
end switch
end if
#-- "combine countPairs by cr and comma" is a little bit faster
#-- but we want the frequency as first item for better oversight.
repeat for each key k in countPairs
put cr & countPairs[k] & comma& k after myOutput
end repeat
delete char 1 of myOutput -- is cr
#-- =============== Essential part END : OUTPUT is variable myOutput
#-- start prepare OUTPUT
sort myOutput -- cosort by word pairs
sort myOutput descending numeric by item 1 of each
put the num of lines of myOutput into diffPairs
put "Total " &k0& "-word phrases: " & (j-k0+1) &cr& "Diff " &k0& \
"-word phrases: " & diffPairs & cr & myOutput into fld "OUTPUT"
#-- end prepare OUTPUT
unlock screen; unlock messages
put (the millisecs - m1) &" ms" into fld "Timing" --> end timing
end mouseUp
### [-hh fecit, April 2016]Code: Select all
-- j is the maximal value of k
repeat for each key k in myWords
if k = j then exit repeat #<-- WRONG
add 1 to countPairs[myWords[k] & space & myWords[k+1]]
end repeatCode: Select all
-- j is the maximal value of k
repeat for each key k in myWords
if k = j then next repeat #<-- CORRECT
add 1 to countPairs[myWords[k] & space & myWords[k+1]]
end repeatThe difference between download+process vs down first and process later is merely the time spent on disk I/O, likely just a couple milliseconds at most.theotherbassist wrote:hh, really helpful. thanks. Honestly, it's quick enough for my purposes with either version. I'm sort of waiting on the "official" release of 8 to switch over--but that reservation is more just out of habit than anything else. Most of the "wait" time in my stack so far comes from pulling the contents of individual nodes from RSS feed XML data anyway--titles, descriptions, urls, etc. Do you think that it would be faster to save all the XML as local files and *then* read in the node content? What I'm doing now is pulling individual xml node contents directly from the web locations. Maybe making everything local in one fell swoop would cut some ms from the repeat functions that access the nodes.
That sounds like a fun project. You might consider writing a guess blog post for livecode.com with that when you've got it running. Drop a note to heather AT livecode.com - they love good development stories like that.theotherbassist wrote:So it's as easy as 1-2-3 then. Hah. I appreciate the detailed response--I feel a bit more confident knowing it's *kind of* a matter of taste and not necessarily better one way vs. the other when parsing limited nodes.
I'm really focusing on titles, and specifically headline news titles. I spend far too much time every day rifling through duplicate titles via Feedly/Nextgen Reader thanks to overlap and widespread use of AP by the press. I'm trying to build an app that runs an algorithm that dismantles, say, 30 headline RSS feeds across the globe into common 1-, 2-, 3-, and 4-word phrases and then rebuilds totally new titles based on word order within common phrases and the average frequency. All article titles on a given subject will be broken down and recombined into one "average title" ultimately presented back to the user. Sometimes the title will end up being one that was already in use, sometimes it'll be slightly contorted--but the point is that unless there's radically different information on the subject that necessitates two separate subject listings, the gist will be contained in the singular new "average" title. At the very least, it reduces overlap.
Amen, brother. I'm a bit of a news junky myself. I made the LiveNet aggregator as part of a larger exploration I hope to get time to get back to soon, with a similar focus on weeding out replicated sources and finding the truly good stuff I'm most interested in.Yeah, you can look at google/facebook trends for this sort of thing. But you can't tailor their list of sources to your liking. I don't want the average of everything on the web--I want the average of my own go-to news outlets.
Ah, if only some of those aggregators even knew what CDATA is. I've seen CDATA tags around plain text, XML, HTML, escaped HTML, and every strange mix you can imagine. I don't know how other parser writers handle it. Ah, the many times I've cursed at the feed generators I've had to work with.p.s. I always remove CDATA references from the XML--at least when it comes to news, it seems like the better info is regularly nested in there.
For the record, I was making a big mistake without knowing it.Most of the "wait" time in my stack so far comes from pulling the contents of individual nodes from RSS feed XML data anyway--titles, descriptions, urls, etc. Do you think that it would be faster to save all the XML as local files and *then* read in the node content? What I'm doing now is pulling individual xml node contents directly from the web locations. Maybe making everything local in one fell swoop would cut some ms from the repeat functions that access the nodes.