Page 1 of 1

Not using quotes for words

Posted: Fri Oct 17, 2014 12:56 pm
by andrewferguson
Hi,

I am trying to use the "word" keyword, but I have run into a problem. I discovered that

Code: Select all

answer the number of words of (quote & "hello world")
would actually answer 1, and not 2 like I expected. I looked in the dictionary and realised that this is intentional. This was a problem for me as I needed quotes to be ignored while processing "words".

I then tried changing my code to use "the number of items of", and setting the itemDelimiter to " ". However this presented another problem as

Code: Select all

answer the number of items of "LiveCode         Forums"
answers 10, and not 2 like I need. Additionally, I have run into problems with using the itemDelimiter as space and two words not separated by a space, but instead seperated by a line break.

So, I was wondering if there was a way to "disable" the quotes part of the "word" keyword, or maybe have more than one itemDelimiter at the same time?

Or is there another way I can do this?

Andrew

Re: Not using quotes for words

Posted: Fri Oct 17, 2014 1:18 pm
by Thierry
I was wondering if there was a way to "disable" the quotes part of the "word" keyword, or maybe have more than one itemDelimiter at the same time?
Or is there another way I can do this?
Andrew,

without the exact context, it's a bit hard to know the *best* to do for you,
but what about copying your original text and doing a set of replace to drop all your disturbing chars?
i,e:

Code: Select all

replace quote with space in yourText
replace return with space in yourText
..
HTH,

Thierry

Re: Not using quotes for words

Posted: Fri Oct 17, 2014 3:19 pm
by FourthWorld
Although it's still in testing it may be helpful to note that v7.0 introduces the new "trueWord" chunk type, which uses Unicode rules for determining words not only independent of white space but also of punctuation.

Re: Not using quotes for words

Posted: Fri Oct 17, 2014 3:52 pm
by [-hh]
Hi Andrew and Thierry,

I think what Andrew wants is not a replacement but a correct counting (yes, Andrew?)

Here: the number of chunks that are delimited by whitespace (the regex "\s").
where the number of such chunks = 1+ the number of occurences of contiguous whitespace.

I would try to walk trough the startString s0 (for each char) and add 1 to a counter N for
every char that is not a whitespace char. Then return length(s0)-N+1.

What do you think about this? Not this easy, perhaps use of regex gives a better solution?
Hermann

@Craig: Think about enlargeing your 'itemdelimiter'-feature-request to enable: set itemdelimiter to whitespace?
Then for example a contiguous mix of 4 spaces, 3 tabs and 42 newlines would count as one delimiter.


@ FourthWorld: Saw your post late, after editing mine. Wouldn't it be better to have influence on a set of chars that is delimiting, say "word-breaking" set, than to adjust with several new version after knowing what is currently in (or not in) the 'word-separator' set?

Re: Not using quotes for words

Posted: Fri Oct 17, 2014 3:59 pm
by FourthWorld
Another option would be to just remove the quotes when counting, e.g.:

Code: Select all

on mouseUp
  put BetterWordCount(fld 1)
end mouseUp

function BetterWordCount s
   replace quote with space in s
   return the number of words of s
end BetterWordCount

Re: Not using quotes for words

Posted: Fri Oct 17, 2014 4:02 pm
by Thierry
[-hh] wrote:Hi Andrew and Thierry,

I think what Andrew wants is not a replacement but a correct counting (yes, Andrew?)
Well, I think so and therefore
I did suggest to erase all the terrorist chars so he can counts true words after that :)
What do you think about this? Not this easy, perhaps use of regex gives a better solution?
Sure, regex will be helpful here..

Regards,

Thierry

Re: Not using quotes for words

Posted: Fri Oct 17, 2014 4:33 pm
by [-hh]
@FourthWorld: betterWordCount("A" &quote & "B")=2?

Re: Not using quotes for words

Posted: Fri Oct 17, 2014 4:39 pm
by andrewferguson
Hi everyone,

Thanks for the replies.
Initially I was against replacing the quotes, but when I thought about it again I realised that using the replace command to replace all the double quotes (") with single quotes (') would allow me to go back to using "the number of words of", and it wouldn't affect the text too much. (For various reasons I cannot replace all the single quotes back to double quotes at the end.)

Andrew

Re: Not using quotes for words

Posted: Fri Oct 17, 2014 7:55 pm
by FourthWorld
[-hh] wrote:@FourthWorld: betterWordCount("A" &quote & "B")=2?
Should two words separated by a quote be considered one word?

Re: Not using quotes for words

Posted: Fri Oct 17, 2014 8:34 pm
by [-hh]
Independent of what I'm thinking about current word separators:
Yes, the string "A"&quote&"B" *is* one word. Following the dictionary for the word definition
Docs wrote:... or if enclosed by quotes.
The impact is here on "enclosed". This means for me:
Everything that is between *a pair* of quotes is one word. Inside the quotes, on a next "level" there my be again several words.

"Incorrect" answers on the number of words are not due to the definition of "word" but due to the definition of "number of" (we already know from items) because the engine puts a closing quote after a string with a single double quote at start of string to avoid an open group at level 1 (its looking for 'pairs' of quotes).
The logic for that is as good or as bad (depending on your point of view) as the current definition of "the number of items".

Re: Not using quotes for words

Posted: Fri Oct 17, 2014 9:41 pm
by FourthWorld
[-hh] wrote:Independent of what I'm thinking about current word separators:
Yes, the string "A"&quote&"B" *is* one word. Following the dictionary for the word definition.
True, but Andrew's request was for something different from the Dictionary's definition of the "word" chunk type, something closer to natural language.

Even then, the simple function I provided won't account for everything. For example, "This.And.That" would still be counted as a single word.

Most common indexing methods strip all punctuation and other special characters, which could be done in script but if that level of effort it needed it may be useful to consider getting started with v7 to take advantage of the Unicode support for such things.