Get Number of Words Within Quotation Marks

LiveCode is the premier environment for creating multi-platform solutions for all major operating systems - Windows, Mac OS X, Linux, the Web, Server environments and Mobile platforms. Brand new to LiveCode? Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10305
Joined: Wed May 06, 2009 2:28 pm

Re: Get Number of Words Within Quotation Marks

Post by dunbarx » Fri Oct 28, 2016 11:23 pm

AxWald.

Yep, that is real old fashioned coding. But what if there are three contiguous spaces?

:wink:

Craig

AxWald
Posts: 578
Joined: Thu Mar 06, 2014 2:57 pm

Re: Get Number of Words Within Quotation Marks

Post by AxWald » Sat Oct 29, 2016 11:03 am

Hi.
dunbarx wrote:But what if there are three contiguous spaces?
Just another loop ;-)

Code: Select all

repeat until offset("  ", MyStr) = 0
When this is done, all double spaces have been slaughtered ;-)

Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10305
Joined: Wed May 06, 2009 2:28 pm

Re: Get Number of Words Within Quotation Marks

Post by dunbarx » Sat Oct 29, 2016 1:58 pm

Excellent!

But what if there are four spaces?

Craig

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 10077
Joined: Fri Feb 19, 2010 10:17 am

Re: Get Number of Words Within Quotation Marks

Post by richmond62 » Sat Oct 29, 2016 2:54 pm

as I understand, the OP wants after the partitioning into words the quotes back to where they were before, for example
"Here I am" should translate to the three 'parts': <"here> and <I> and <am!">.
Well: I'd be inclined to run through the text and replace all the quotes with some other symbol that isn't used anywhere
else but doesn't interfere like quotes do . . .

Do the "other stuff"

And then replace the other symbols with quotes again.

AxWald
Posts: 578
Joined: Thu Mar 06, 2014 2:57 pm

Re: Get Number of Words Within Quotation Marks

Post by AxWald » Sat Oct 29, 2016 3:19 pm

Craig,
dunbarx wrote:But what if there are four spaces?
This is hardcore code. It runs until there are no more consecutive spaces, or until all CPUs go up in flames. Whatever comes first! 8)

Richmond,
richmond62 wrote:I'd be inclined to run through the text and replace all the quotes with some other symbol that isn't used anywhere [...]
From the demo stack above:

Code: Select all

  put ",.;:-?*1234567890" & quote into KillIt  -- this is not wanted.
   repeat for each char TheChar in MyStr
      if theChar is not in KillIt then put theChar after MyVar
      if theChar is quote then put "-" after MyVar
   end repeat
richmond62 wrote:And then replace the other symbols with quotes again.

Code: Select all

   replace "-" with quote in MyList
:D

Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 10077
Joined: Fri Feb 19, 2010 10:17 am

Re: Get Number of Words Within Quotation Marks

Post by richmond62 » Sat Oct 29, 2016 3:43 pm

I did {have fun . . . well, of a moderate sort]
tFlasher.png
text flasher.livecode.zip
Stack
(1.55 KiB) Downloaded 257 times
Mind you, if you want to start flashing up individual words; as become obvious
from my stack; you'll need to strip out punctuation marks as well.

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10305
Joined: Wed May 06, 2009 2:28 pm

Re: Get Number of Words Within Quotation Marks

Post by dunbarx » Sun Oct 30, 2016 6:17 pm

@AxWald.

I thought there was something wrong with your handler. If I just paste it into LC, it does not work. If I delete what appear to be spaces, and actually type spaces, it does.

Craig

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10305
Joined: Wed May 06, 2009 2:28 pm

Re: Get Number of Words Within Quotation Marks

Post by dunbarx » Sun Oct 30, 2016 6:20 pm

Ah.

It is ASCII 202, not space.

Craig

AxWald
Posts: 578
Joined: Thu Mar 06, 2014 2:57 pm

Re: Get Number of Words Within Quotation Marks

Post by AxWald » Mon Oct 31, 2016 10:06 am

Hi.
dunbarx wrote:It is ASCII 202, not space.

Code: Select all

repeat until offset("  ", MyStr) = 0
from this post?

Strange. If I copy it from the browser, it's ASCII 32 as intended.

Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10305
Joined: Wed May 06, 2009 2:28 pm

Re: Get Number of Words Within Quotation Marks

Post by dunbarx » Mon Oct 31, 2016 2:40 pm

Hmmm.

That is really weird. I again copied the handler, and changed it slightly, taking care NOT to modify the "text" within the quotes, but only to delete one char. I made

Code: Select all

unction killSpaces MyStr
   return charToNum(" ")
end killSpaces
The value between the quotes is what you posted, minus one char. I get 202.

I do not know if this is a LC issue or a browser issue. I am on a Mac.

Craig

tonymac
Posts: 23
Joined: Thu Jan 05, 2012 9:17 pm

Re: Get Number of Words Within Quotation Marks

Post by tonymac » Thu Nov 17, 2016 2:14 pm

Sorry for this long post but I felt I needed to clarify what I'm trying to do as a lot of the responses were going astray.

This project is for MOBILE PHONES only.

1. Download simple/plain text file.
2. Get an EXACT count of the number of words in this text file.
3. Show the text of this file one word at a time in the same field.

... (from sample text below) put "Once" in field "TheText", replace "Once" with the word "upon", replace the word "upon" with the word "a", etc. The delay between replacing the words is in milliseconds probably averaging 200-300 milliseconds. The short/simple explanation is to say that the words are shown/replaced in a repeat loop using the total number of words as the max number in the repeat loop. If the number of words count is off then the repeat loop will end up not showing all of the text. While for this sample of 100 words, depending on which method you use to count words, it's off from 1 to half a dozen words. While that might not seem like much, some of the text files might be over 70,000 words which means the word count could be off in the hundreds for such a text file.

Hope to be able...
... to use plain text files to keep downloading time quick for MOBILE PHONES.
... to NOT have to massage the text moving/replacing characters/words after it's downloaded.
... to NOT have to use unicode as again further time consuming massaging of the text.

The quick solution would be for LC to recognize words within quotation marks as separate words but that's not how the "word" keyword works. Words are defined in most if not all English text editors as character separated by one or more spaces. I have tried many variations of counting words, items, true words, tokens, etc. to try to get an exact word count but none of the methods (even some of the ones posted here) have been exact.

Below is a small sample of the text. If you feel challenged, try a method you have in mind that will be able to give a count of 100 words. Remember that individual words are groups of characters separated by one or more spaces. If you try counting words by getting the number of items and there is more than one space between words, the additional spaces are counted as words which give an inaccurate word count (remember the part that I don't want to have to massage the text after the file is downloaded).

(HERE IS SOME TEXT. IT'S EXACTLY 100 WORDS LONG. COPY/PASTE IT INTO A TEXT EDITOR AND VERIFY THE WORD COUNT)
Once upon a time there were four little Rabbits, and their names were-- Flopsy, Mopsy, Cotton-tail, and Peter. They lived with their Mother in a sand-bank, underneath the root of a very big fir-tree.

"Now, my dears," said old Mrs. Rabbit one morning, "you may go into the fields or down the lane, but don't go into Mr. McGregor's garden: your Father had an accident there; he was put in a pie by Mrs. McGregor."

"Now run along, and don't get into mischief." Then Mrs. Rabbit took a basket and her umbrella, and went through the woods to the baker's.

(END OF TEXT)

I am still continuing to try to figure this out as it's critical to the project. Thanks for any input.

Thierry
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 875
Joined: Wed Nov 22, 2006 3:42 pm

Re: Get Number of Words Within Quotation Marks

Post by Thierry » Thu Nov 17, 2016 2:55 pm

tonymac wrote: (HERE IS SOME TEXT. IT'S EXACTLY 100 WORDS LONG. COPY/PASTE IT INTO A TEXT EDITOR AND VERIFY THE WORD COUNT)
Thanks for any input.
Following your request,
I copy and paste into BBEdit, and here the characters/words/line counters:
sunnY 2016-11-17 à 14.47.22.png
Well, 103 here! :shock:

Regards,

Thierry
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10043
Joined: Sat Apr 08, 2006 7:05 am
Contact:

Re: Get Number of Words Within Quotation Marks

Post by FourthWorld » Thu Nov 17, 2016 3:15 pm

With that sample text:

LiveCode words: 60
LiveCode trueWords: 103
gEdit text editor: 107
Geany text editor: 100
LibreOffice Writer: 100

Curious about the differences, I wrote this script:

Code: Select all

on mouseUp
   put fld 1 into s
   put 0 into i
   repeat for each trueWord w in s
      add 1 to i
      put i && w &cr after t 
   end repeat
   put t into fld 2
end mouseUp
...which yields:

Code: Select all

1 Once
2 upon
3 a
4 time
5 there
6 were
7 four
8 little
9 Rabbits
10 and
11 their
12 names
13 were
14 Flopsy
15 Mopsy
16 Cotton
17 tail
18 and
19 Peter
20 They
21 lived
22 with
23 their
24 Mother
25 in
26 a
27 sand
28 bank
29 underneath
30 the
31 root
32 of
33 a
34 very
35 big
36 fir
37 tree
38 Now
39 my
40 dears
41 said
42 old
43 Mrs
44 Rabbit
45 one
46 morning
47 you
48 may
49 go
50 into
51 the
52 fields
53 or
54 down
55 the
56 lane
57 but
58 don't
59 go
60 into
61 Mr
62 McGregor's
63 garden
64 your
65 Father
66 had
67 an
68 accident
69 there
70 he
71 was
72 put
73 in
74 a
75 pie
76 by
77 Mrs
78 McGregor
79 Now
80 run
81 along
82 and
83 don't
84 get
85 into
86 mischief
87 Then
88 Mrs
89 Rabbit
90 took
91 a
92 basket
93 and
94 her
95 umbrella
96 and
97 went
98 through
99 the
100 woods
101 to
102 the
103 baker's
Which of those are not words?

At first glance it appears that the language researchers at IBM did a reasonably good job of creating the Unicode parsing libraries LiveCode uses with trueWord.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

shaosean
Posts: 906
Joined: Thu Nov 04, 2010 7:53 am

Re: Get Number of Words Within Quotation Marks

Post by shaosean » Thu Nov 17, 2016 3:45 pm

Code: Select all

on mouseUp
   local output
   repeat with i = 1 to the number of words in the text of field "log"
      if (char 1 of word i of the text of field "log" = QUOTE) then
         local tempWord
         put word i of the text of field "log" into tempWord
         delete char 1 of tempWord
         delete char -1 of tempWord
         repeat with j = 1 to the number of words in tempWord
            put word j of tempWord & LF after output
         end repeat
      else
         put word i of the text of field "log" & LF after output
      end if
   end repeat
   put the number of words of output
end mouseUp

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10043
Joined: Sat Apr 08, 2006 7:05 am
Contact:

Re: Get Number of Words Within Quotation Marks

Post by FourthWorld » Thu Nov 17, 2016 6:11 pm

Nicely done, shaosean.

I converted both your handler and the one using trueWord above into functions which each return numbered word lists:

Code: Select all

on mouseUp
   put fld "log" into s
   put TrueWordList(s) into fld "out1"
   put ShaoSeanList(s) into fld "out2"
end mouseUp

function TrueWordList s
   put 0 into i
   repeat for each trueWord w in s
      add 1 to i
      put i && w &cr after t 
   end repeat
   return t
end TrueWordList

function ShaoSeanList s
   repeat with i = 1 to the number of words in s
      if (char 1 of word i of s = QUOTE) then
         local tempWord
         put word i of s into tempWord
         delete char 1 of tempWord
         delete char -1 of tempWord
         repeat with j = 1 to the number of words in tempWord
            put word j of tempWord & LF after output
         end repeat
      else
         put word i of s & LF after output
      end if
   end repeat
   --
   put 0 into i
   repeat for each word w in output
      add 1 to i
      put i && w &cr after tOutList
   end repeat
   return tOutList
end ShaoSeanList
When done "out1" contains:

Code: Select all

1 Once
2 upon
3 a
4 time
5 there
6 were
7 four
8 little
9 Rabbits
10 and
11 their
12 names
13 were
14 Flopsy
15 Mopsy
16 Cotton
17 tail
18 and
19 Peter
20 They
21 lived
22 with
23 their
24 Mother
25 in
26 a
27 sand
28 bank
29 underneath
30 the
31 root
32 of
33 a
34 very
35 big
36 fir
37 tree
38 Now
39 my
40 dears
41 said
42 old
43 Mrs
44 Rabbit
45 one
46 morning
47 you
48 may
49 go
50 into
51 the
52 fields
53 or
54 down
55 the
56 lane
57 but
58 don't
59 go
60 into
61 Mr
62 McGregor's
63 garden
64 your
65 Father
66 had
67 an
68 accident
69 there
70 he
71 was
72 put
73 in
74 a
75 pie
76 by
77 Mrs
78 McGregor
79 Now
80 run
81 along
82 and
83 don't
84 get
85 into
86 mischief
87 Then
88 Mrs
89 Rabbit
90 took
91 a
92 basket
93 and
94 her
95 umbrella
96 and
97 went
98 through
99 the
100 woods
101 to
102 the
103 baker's
...and "out2" contains:

Code: Select all

1 Once
2 upon
3 a
4 time
5 there
6 were
7 four
8 little
9 Rabbits,
10 and
11 their
12 names
13 were--
14 Flopsy,
15 Mopsy,
16 Cotton-tail,

17 and
18 Peter.
19 They
20 lived
21 with
22 their
23 Mother
24 in
25 a
26 sand-bank,
27 underneath
28 the
29 root
30 of
31 a
32 very
33 big
34 fir-tree.
35 Now,
36 my
37 dears,
38 said
39 old
40 Mrs.
41 Rabbit
42 one
43 morning,
44 you
45 may
46 go
47 into
48 the
49 fields
50 or
51 down
52 the
53 lane,
54 but
55 don't
56 go
57 into
58 Mr.
59 McGregor's
60 garden:
61 your
62 Father
63 had
64 an
65 accident
66 there;
67 he
68 was
69 put
70 in
71 a
72 pie
73 by
74 Mrs.
75 McGregor.
76 Now
77 run
78 along,
79 and
80 don't
81 get
82 into
83 mischief.
84 Then
85 Mrs.
86 Rabbit
87 took
88 a
89 basket
90 and
91 her
92 umbrella,
93 and
94 went
95 through
96 the
97 woods
98 to
99 the
100 baker's.
When we examine both we find the three strings each algo treats differently are:

Cotton-tail (line 16 and 17 in "out1", line 16 in "out2")
sand-bank (lines 27 and 28 in "out1", line 26 in "out2")
fir-tree (lines 36 and 37 in "out1", line 34 in "out2")

As a proper noun, whether the string "Cotton-tail" is an inappropriate stylistic form is subjective; IMO good arguments could be made either way, since the rabbit species is commonly written as "cottontail", but proper nouns are often culture-dependent and as such usually allowed to use whatever idiomatic rules the named prefers. It's worth noting, however, that as a title we most commonly see "Peter Cottontail", with no hyphen.

Based on some quick web searches, it would seem the string "sand-bank" is uncommon in modern English usage, more commonly written as "sand bank" or "sandbank".

Similarly, while "fir-tree" is used more commonly than "sand-bank", even more common is simply "fir tree".

In both of the latter two cases, hyphen or not I think we'd all agree we're looking at two words. The exception is only the proper noun, and even that differs from how it appears in the story's own title.

All three of these anomalies appear to be unique to the stylized nature of that children's story, and perhaps idioms common to the US East Coast circa 1900.

The advantage of your algo is that it produces a word count more in keeping with an apparently plurality of algos in common use.

The advantages of trueWord are coding simplicity, runtime efficiency, and the ability to easily identify words devoid of punctuation, while still being consistent with word counts we see in at least some other software like the one Thierry noted above.

Which algo is "best" will of course depend on what it's expected to do. Early on there was some suggestion that being able to identify words (presumably without adjacent punctuation) was a requirement, but it's unclear to me if that's truly needed or if the only requirement is a word count.

Another factor may be the nature of the data being examined. Unless working exclusively with US English texts, trueWord will likely produce better results across a wider range of data, and even with US English will deliver counts generally on par with modern English usage (and personal promouns aside, arguably more accurate than counting hyphenated expressions as a single word).
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Post Reply