Page 2 of 3

Re: Get Number of Words Within Quotation Marks

Posted: Fri Oct 28, 2016 11:23 pm
by dunbarx
AxWald.

Yep, that is real old fashioned coding. But what if there are three contiguous spaces?

:wink:

Craig

Re: Get Number of Words Within Quotation Marks

Posted: Sat Oct 29, 2016 11:03 am
by AxWald
Hi.
dunbarx wrote:But what if there are three contiguous spaces?
Just another loop ;-)

Code: Select all

repeat until offset("  ", MyStr) = 0
When this is done, all double spaces have been slaughtered ;-)

Have fun!

Re: Get Number of Words Within Quotation Marks

Posted: Sat Oct 29, 2016 1:58 pm
by dunbarx
Excellent!

But what if there are four spaces?

Craig

Re: Get Number of Words Within Quotation Marks

Posted: Sat Oct 29, 2016 2:54 pm
by richmond62
as I understand, the OP wants after the partitioning into words the quotes back to where they were before, for example
"Here I am" should translate to the three 'parts': <"here> and <I> and <am!">.
Well: I'd be inclined to run through the text and replace all the quotes with some other symbol that isn't used anywhere
else but doesn't interfere like quotes do . . .

Do the "other stuff"

And then replace the other symbols with quotes again.

Re: Get Number of Words Within Quotation Marks

Posted: Sat Oct 29, 2016 3:19 pm
by AxWald
Craig,
dunbarx wrote:But what if there are four spaces?
This is hardcore code. It runs until there are no more consecutive spaces, or until all CPUs go up in flames. Whatever comes first! 8)

Richmond,
richmond62 wrote:I'd be inclined to run through the text and replace all the quotes with some other symbol that isn't used anywhere [...]
From the demo stack above:

Code: Select all

  put ",.;:-?*1234567890" & quote into KillIt  -- this is not wanted.
   repeat for each char TheChar in MyStr
      if theChar is not in KillIt then put theChar after MyVar
      if theChar is quote then put "-" after MyVar
   end repeat
richmond62 wrote:And then replace the other symbols with quotes again.

Code: Select all

   replace "-" with quote in MyList
:D

Have fun!

Re: Get Number of Words Within Quotation Marks

Posted: Sat Oct 29, 2016 3:43 pm
by richmond62
I did {have fun . . . well, of a moderate sort]
tFlasher.png
text flasher.livecode.zip
Stack
(1.55 KiB) Downloaded 364 times
Mind you, if you want to start flashing up individual words; as become obvious
from my stack; you'll need to strip out punctuation marks as well.

Re: Get Number of Words Within Quotation Marks

Posted: Sun Oct 30, 2016 6:17 pm
by dunbarx
@AxWald.

I thought there was something wrong with your handler. If I just paste it into LC, it does not work. If I delete what appear to be spaces, and actually type spaces, it does.

Craig

Re: Get Number of Words Within Quotation Marks

Posted: Sun Oct 30, 2016 6:20 pm
by dunbarx
Ah.

It is ASCII 202, not space.

Craig

Re: Get Number of Words Within Quotation Marks

Posted: Mon Oct 31, 2016 10:06 am
by AxWald
Hi.
dunbarx wrote:It is ASCII 202, not space.

Code: Select all

repeat until offset("  ", MyStr) = 0
from this post?

Strange. If I copy it from the browser, it's ASCII 32 as intended.

Have fun!

Re: Get Number of Words Within Quotation Marks

Posted: Mon Oct 31, 2016 2:40 pm
by dunbarx
Hmmm.

That is really weird. I again copied the handler, and changed it slightly, taking care NOT to modify the "text" within the quotes, but only to delete one char. I made

Code: Select all

unction killSpaces MyStr
   return charToNum(" ")
end killSpaces
The value between the quotes is what you posted, minus one char. I get 202.

I do not know if this is a LC issue or a browser issue. I am on a Mac.

Craig

Re: Get Number of Words Within Quotation Marks

Posted: Thu Nov 17, 2016 2:14 pm
by tonymac
Sorry for this long post but I felt I needed to clarify what I'm trying to do as a lot of the responses were going astray.

This project is for MOBILE PHONES only.

1. Download simple/plain text file.
2. Get an EXACT count of the number of words in this text file.
3. Show the text of this file one word at a time in the same field.

... (from sample text below) put "Once" in field "TheText", replace "Once" with the word "upon", replace the word "upon" with the word "a", etc. The delay between replacing the words is in milliseconds probably averaging 200-300 milliseconds. The short/simple explanation is to say that the words are shown/replaced in a repeat loop using the total number of words as the max number in the repeat loop. If the number of words count is off then the repeat loop will end up not showing all of the text. While for this sample of 100 words, depending on which method you use to count words, it's off from 1 to half a dozen words. While that might not seem like much, some of the text files might be over 70,000 words which means the word count could be off in the hundreds for such a text file.

Hope to be able...
... to use plain text files to keep downloading time quick for MOBILE PHONES.
... to NOT have to massage the text moving/replacing characters/words after it's downloaded.
... to NOT have to use unicode as again further time consuming massaging of the text.

The quick solution would be for LC to recognize words within quotation marks as separate words but that's not how the "word" keyword works. Words are defined in most if not all English text editors as character separated by one or more spaces. I have tried many variations of counting words, items, true words, tokens, etc. to try to get an exact word count but none of the methods (even some of the ones posted here) have been exact.

Below is a small sample of the text. If you feel challenged, try a method you have in mind that will be able to give a count of 100 words. Remember that individual words are groups of characters separated by one or more spaces. If you try counting words by getting the number of items and there is more than one space between words, the additional spaces are counted as words which give an inaccurate word count (remember the part that I don't want to have to massage the text after the file is downloaded).

(HERE IS SOME TEXT. IT'S EXACTLY 100 WORDS LONG. COPY/PASTE IT INTO A TEXT EDITOR AND VERIFY THE WORD COUNT)
Once upon a time there were four little Rabbits, and their names were-- Flopsy, Mopsy, Cotton-tail, and Peter. They lived with their Mother in a sand-bank, underneath the root of a very big fir-tree.

"Now, my dears," said old Mrs. Rabbit one morning, "you may go into the fields or down the lane, but don't go into Mr. McGregor's garden: your Father had an accident there; he was put in a pie by Mrs. McGregor."

"Now run along, and don't get into mischief." Then Mrs. Rabbit took a basket and her umbrella, and went through the woods to the baker's.

(END OF TEXT)

I am still continuing to try to figure this out as it's critical to the project. Thanks for any input.

Re: Get Number of Words Within Quotation Marks

Posted: Thu Nov 17, 2016 2:55 pm
by Thierry
tonymac wrote: (HERE IS SOME TEXT. IT'S EXACTLY 100 WORDS LONG. COPY/PASTE IT INTO A TEXT EDITOR AND VERIFY THE WORD COUNT)
Thanks for any input.
Following your request,
I copy and paste into BBEdit, and here the characters/words/line counters:
sunnY 2016-11-17 à 14.47.22.png
Well, 103 here! :shock:

Regards,

Thierry

Re: Get Number of Words Within Quotation Marks

Posted: Thu Nov 17, 2016 3:15 pm
by FourthWorld
With that sample text:

LiveCode words: 60
LiveCode trueWords: 103
gEdit text editor: 107
Geany text editor: 100
LibreOffice Writer: 100

Curious about the differences, I wrote this script:

Code: Select all

on mouseUp
   put fld 1 into s
   put 0 into i
   repeat for each trueWord w in s
      add 1 to i
      put i && w &cr after t 
   end repeat
   put t into fld 2
end mouseUp
...which yields:

Code: Select all

1 Once
2 upon
3 a
4 time
5 there
6 were
7 four
8 little
9 Rabbits
10 and
11 their
12 names
13 were
14 Flopsy
15 Mopsy
16 Cotton
17 tail
18 and
19 Peter
20 They
21 lived
22 with
23 their
24 Mother
25 in
26 a
27 sand
28 bank
29 underneath
30 the
31 root
32 of
33 a
34 very
35 big
36 fir
37 tree
38 Now
39 my
40 dears
41 said
42 old
43 Mrs
44 Rabbit
45 one
46 morning
47 you
48 may
49 go
50 into
51 the
52 fields
53 or
54 down
55 the
56 lane
57 but
58 don't
59 go
60 into
61 Mr
62 McGregor's
63 garden
64 your
65 Father
66 had
67 an
68 accident
69 there
70 he
71 was
72 put
73 in
74 a
75 pie
76 by
77 Mrs
78 McGregor
79 Now
80 run
81 along
82 and
83 don't
84 get
85 into
86 mischief
87 Then
88 Mrs
89 Rabbit
90 took
91 a
92 basket
93 and
94 her
95 umbrella
96 and
97 went
98 through
99 the
100 woods
101 to
102 the
103 baker's
Which of those are not words?

At first glance it appears that the language researchers at IBM did a reasonably good job of creating the Unicode parsing libraries LiveCode uses with trueWord.

Re: Get Number of Words Within Quotation Marks

Posted: Thu Nov 17, 2016 3:45 pm
by shaosean

Code: Select all

on mouseUp
   local output
   repeat with i = 1 to the number of words in the text of field "log"
      if (char 1 of word i of the text of field "log" = QUOTE) then
         local tempWord
         put word i of the text of field "log" into tempWord
         delete char 1 of tempWord
         delete char -1 of tempWord
         repeat with j = 1 to the number of words in tempWord
            put word j of tempWord & LF after output
         end repeat
      else
         put word i of the text of field "log" & LF after output
      end if
   end repeat
   put the number of words of output
end mouseUp

Re: Get Number of Words Within Quotation Marks

Posted: Thu Nov 17, 2016 6:11 pm
by FourthWorld
Nicely done, shaosean.

I converted both your handler and the one using trueWord above into functions which each return numbered word lists:

Code: Select all

on mouseUp
   put fld "log" into s
   put TrueWordList(s) into fld "out1"
   put ShaoSeanList(s) into fld "out2"
end mouseUp

function TrueWordList s
   put 0 into i
   repeat for each trueWord w in s
      add 1 to i
      put i && w &cr after t 
   end repeat
   return t
end TrueWordList

function ShaoSeanList s
   repeat with i = 1 to the number of words in s
      if (char 1 of word i of s = QUOTE) then
         local tempWord
         put word i of s into tempWord
         delete char 1 of tempWord
         delete char -1 of tempWord
         repeat with j = 1 to the number of words in tempWord
            put word j of tempWord & LF after output
         end repeat
      else
         put word i of s & LF after output
      end if
   end repeat
   --
   put 0 into i
   repeat for each word w in output
      add 1 to i
      put i && w &cr after tOutList
   end repeat
   return tOutList
end ShaoSeanList
When done "out1" contains:

Code: Select all

1 Once
2 upon
3 a
4 time
5 there
6 were
7 four
8 little
9 Rabbits
10 and
11 their
12 names
13 were
14 Flopsy
15 Mopsy
16 Cotton
17 tail
18 and
19 Peter
20 They
21 lived
22 with
23 their
24 Mother
25 in
26 a
27 sand
28 bank
29 underneath
30 the
31 root
32 of
33 a
34 very
35 big
36 fir
37 tree
38 Now
39 my
40 dears
41 said
42 old
43 Mrs
44 Rabbit
45 one
46 morning
47 you
48 may
49 go
50 into
51 the
52 fields
53 or
54 down
55 the
56 lane
57 but
58 don't
59 go
60 into
61 Mr
62 McGregor's
63 garden
64 your
65 Father
66 had
67 an
68 accident
69 there
70 he
71 was
72 put
73 in
74 a
75 pie
76 by
77 Mrs
78 McGregor
79 Now
80 run
81 along
82 and
83 don't
84 get
85 into
86 mischief
87 Then
88 Mrs
89 Rabbit
90 took
91 a
92 basket
93 and
94 her
95 umbrella
96 and
97 went
98 through
99 the
100 woods
101 to
102 the
103 baker's
...and "out2" contains:

Code: Select all

1 Once
2 upon
3 a
4 time
5 there
6 were
7 four
8 little
9 Rabbits,
10 and
11 their
12 names
13 were--
14 Flopsy,
15 Mopsy,
16 Cotton-tail,

17 and
18 Peter.
19 They
20 lived
21 with
22 their
23 Mother
24 in
25 a
26 sand-bank,
27 underneath
28 the
29 root
30 of
31 a
32 very
33 big
34 fir-tree.
35 Now,
36 my
37 dears,
38 said
39 old
40 Mrs.
41 Rabbit
42 one
43 morning,
44 you
45 may
46 go
47 into
48 the
49 fields
50 or
51 down
52 the
53 lane,
54 but
55 don't
56 go
57 into
58 Mr.
59 McGregor's
60 garden:
61 your
62 Father
63 had
64 an
65 accident
66 there;
67 he
68 was
69 put
70 in
71 a
72 pie
73 by
74 Mrs.
75 McGregor.
76 Now
77 run
78 along,
79 and
80 don't
81 get
82 into
83 mischief.
84 Then
85 Mrs.
86 Rabbit
87 took
88 a
89 basket
90 and
91 her
92 umbrella,
93 and
94 went
95 through
96 the
97 woods
98 to
99 the
100 baker's.
When we examine both we find the three strings each algo treats differently are:

Cotton-tail (line 16 and 17 in "out1", line 16 in "out2")
sand-bank (lines 27 and 28 in "out1", line 26 in "out2")
fir-tree (lines 36 and 37 in "out1", line 34 in "out2")

As a proper noun, whether the string "Cotton-tail" is an inappropriate stylistic form is subjective; IMO good arguments could be made either way, since the rabbit species is commonly written as "cottontail", but proper nouns are often culture-dependent and as such usually allowed to use whatever idiomatic rules the named prefers. It's worth noting, however, that as a title we most commonly see "Peter Cottontail", with no hyphen.

Based on some quick web searches, it would seem the string "sand-bank" is uncommon in modern English usage, more commonly written as "sand bank" or "sandbank".

Similarly, while "fir-tree" is used more commonly than "sand-bank", even more common is simply "fir tree".

In both of the latter two cases, hyphen or not I think we'd all agree we're looking at two words. The exception is only the proper noun, and even that differs from how it appears in the story's own title.

All three of these anomalies appear to be unique to the stylized nature of that children's story, and perhaps idioms common to the US East Coast circa 1900.

The advantage of your algo is that it produces a word count more in keeping with an apparently plurality of algos in common use.

The advantages of trueWord are coding simplicity, runtime efficiency, and the ability to easily identify words devoid of punctuation, while still being consistent with word counts we see in at least some other software like the one Thierry noted above.

Which algo is "best" will of course depend on what it's expected to do. Early on there was some suggestion that being able to identify words (presumably without adjacent punctuation) was a requirement, but it's unclear to me if that's truly needed or if the only requirement is a word count.

Another factor may be the nature of the data being examined. Unless working exclusively with US English texts, trueWord will likely produce better results across a wider range of data, and even with US English will deliver counts generally on par with modern English usage (and personal promouns aside, arguably more accurate than counting hyphenated expressions as a single word).