Nicely done, shaosean.
I converted both your handler and the one using trueWord above into functions which each return numbered word lists:
Code: Select all
on mouseUp
put fld "log" into s
put TrueWordList(s) into fld "out1"
put ShaoSeanList(s) into fld "out2"
end mouseUp
function TrueWordList s
put 0 into i
repeat for each trueWord w in s
add 1 to i
put i && w &cr after t
end repeat
return t
end TrueWordList
function ShaoSeanList s
repeat with i = 1 to the number of words in s
if (char 1 of word i of s = QUOTE) then
local tempWord
put word i of s into tempWord
delete char 1 of tempWord
delete char -1 of tempWord
repeat with j = 1 to the number of words in tempWord
put word j of tempWord & LF after output
end repeat
else
put word i of s & LF after output
end if
end repeat
--
put 0 into i
repeat for each word w in output
add 1 to i
put i && w &cr after tOutList
end repeat
return tOutList
end ShaoSeanList
When done "out1" contains:
Code: Select all
1 Once
2 upon
3 a
4 time
5 there
6 were
7 four
8 little
9 Rabbits
10 and
11 their
12 names
13 were
14 Flopsy
15 Mopsy
16 Cotton
17 tail
18 and
19 Peter
20 They
21 lived
22 with
23 their
24 Mother
25 in
26 a
27 sand
28 bank
29 underneath
30 the
31 root
32 of
33 a
34 very
35 big
36 fir
37 tree
38 Now
39 my
40 dears
41 said
42 old
43 Mrs
44 Rabbit
45 one
46 morning
47 you
48 may
49 go
50 into
51 the
52 fields
53 or
54 down
55 the
56 lane
57 but
58 don't
59 go
60 into
61 Mr
62 McGregor's
63 garden
64 your
65 Father
66 had
67 an
68 accident
69 there
70 he
71 was
72 put
73 in
74 a
75 pie
76 by
77 Mrs
78 McGregor
79 Now
80 run
81 along
82 and
83 don't
84 get
85 into
86 mischief
87 Then
88 Mrs
89 Rabbit
90 took
91 a
92 basket
93 and
94 her
95 umbrella
96 and
97 went
98 through
99 the
100 woods
101 to
102 the
103 baker's
...and "out2" contains:
Code: Select all
1 Once
2 upon
3 a
4 time
5 there
6 were
7 four
8 little
9 Rabbits,
10 and
11 their
12 names
13 were--
14 Flopsy,
15 Mopsy,
16 Cotton-tail,
17 and
18 Peter.
19 They
20 lived
21 with
22 their
23 Mother
24 in
25 a
26 sand-bank,
27 underneath
28 the
29 root
30 of
31 a
32 very
33 big
34 fir-tree.
35 Now,
36 my
37 dears,
38 said
39 old
40 Mrs.
41 Rabbit
42 one
43 morning,
44 you
45 may
46 go
47 into
48 the
49 fields
50 or
51 down
52 the
53 lane,
54 but
55 don't
56 go
57 into
58 Mr.
59 McGregor's
60 garden:
61 your
62 Father
63 had
64 an
65 accident
66 there;
67 he
68 was
69 put
70 in
71 a
72 pie
73 by
74 Mrs.
75 McGregor.
76 Now
77 run
78 along,
79 and
80 don't
81 get
82 into
83 mischief.
84 Then
85 Mrs.
86 Rabbit
87 took
88 a
89 basket
90 and
91 her
92 umbrella,
93 and
94 went
95 through
96 the
97 woods
98 to
99 the
100 baker's.
When we examine both we find the three strings each algo treats differently are:
Cotton-tail (line 16 and 17 in "out1", line 16 in "out2")
sand-bank (lines 27 and 28 in "out1", line 26 in "out2")
fir-tree (lines 36 and 37 in "out1", line 34 in "out2")
As a proper noun, whether the string "Cotton-tail" is an inappropriate stylistic form is subjective; IMO good arguments could be made either way, since the rabbit species is commonly written as "cottontail", but proper nouns are often culture-dependent and as such usually allowed to use whatever idiomatic rules the named prefers. It's worth noting, however, that as a title we most commonly see "Peter Cottontail", with no hyphen.
Based on some quick web searches, it would seem the string "sand-bank" is uncommon in modern English usage, more commonly written as "sand bank" or "sandbank".
Similarly, while "fir-tree" is used more commonly than "sand-bank", even more common is simply "fir tree".
In both of the latter two cases, hyphen or not I think we'd all agree we're looking at two words. The exception is only the proper noun, and even that differs from how it appears in the story's own title.
All three of these anomalies appear to be unique to the stylized nature of that children's story, and perhaps idioms common to the US East Coast circa 1900.
The advantage of your algo is that it produces a word count more in keeping with an apparently plurality of algos in common use.
The advantages of trueWord are coding simplicity, runtime efficiency, and the ability to easily identify words devoid of punctuation, while still being consistent with word counts we see in at least some other software like the one Thierry noted above.
Which algo is "best" will of course depend on what it's expected to do. Early on there was some suggestion that being able to identify words (presumably without adjacent punctuation) was a requirement, but it's unclear to me if that's truly needed or if the only requirement is a word count.
Another factor may be the nature of the data being examined. Unless working exclusively with US English texts, trueWord will likely produce better results across a wider range of data, and even with US English will deliver counts generally on par with modern English usage (and personal promouns aside, arguably more accurate than counting hyphenated expressions as a single word).