Page 1 of 1

wordOffset problem

Posted: Sun Feb 20, 2011 7:24 am
by Barkandgargle
I'm trying to grab the URL on any given wikipedia page, for the "official website" shown in "External Links". I grab the HTML source for the wikipedia page in question and then am attempting to use the wordoffset to find the URL string. The URL I want is always preceeded by the code <li><span class="official website"> . So I search for this string first.

The problem I'm having: because the word official has no space between it and the quotation mark before it, wordoffset does not see official as a separate word. Ok - so then I search for the word official, including all the characters before it, up until the prior space. Which means I'm searching for <li><span class="official But wordoffset always returns zero, indicating it could not find anything. When gosh darn it I can obviously see that it is indeed there.

Because what I'm looking for contains a quotation mark, I can't include that in the string. So I construct the string to search for and put it into a variable. Such as put "<li><span class=" & quote & "official" into xHolder. I've checked by putting the string into the message window to make sure the string is as it should be, and it is. But again, wordoffset returns zero. Argggg.

Any and all advice on this greatly appreciated.

Re: wordOffset problem

Posted: Sun Feb 20, 2011 7:29 am
by Barkandgargle
Maybe a shorter way of asking about my issue about this above is this: how can I locate words or phrases within html code, when they're not necessarily separated by spaces? Wordoffset won't identify a word unless it is separated by spaces. If I can figure out how to do this, I won't need to much around solving what I've articulated in the post above.

Thanks.

Re: wordOffset problem

Posted: Sun Feb 20, 2011 7:36 am
by bangkok
Barkandgargle wrote: But wordoffset always returns zero, indicating it could not find anything.
No. ;-)

Dictionary is your (best) friend.

If the wordToFind is more than one word, the wordOffset function always returns zero, even if the wordToFind appears in the stringToSearch.

Re: wordOffset problem

Posted: Sun Feb 20, 2011 7:38 am
by Barkandgargle
Hi Bankgkok. Thanks for replying. However, the word I'm finding is not more than one word. The fact that it is a contiguous set of characters means that livecode is seeing it as one word.

Here's something else:

If I use put "<li><span class=" & quote & "official " is in field htmlsource it returns true - indicating the string or "word" does exist and can be found in the code.

But, as I said, when I use put the wordOffset ("<li><span class=" & quote & "official ", field "htmlsource") it returns zero, indicating it cannot be found. And I need to know not only that it exists, but what word number it is.

??

Re: wordOffset problem

Posted: Sun Feb 20, 2011 7:49 am
by bangkok
Ah okay. So you should perhaps make a replace.

replace "<li>" with space in myHTMLSource

That would allow the wordoffset to find the "word" (because 1 space before, 1 space after).

Re: wordOffset problem

Posted: Sun Feb 20, 2011 12:16 pm
by BvG
words are delmited by space, tab or return. In addition everything enclosed by quotes is a word.

you are searching for a string that contains a quote, and LC can't see that as a word. Use the normal offset to do that.

Re: wordOffset problem

Posted: Mon Feb 28, 2011 11:54 pm
by uabclst
Have you tried to use matchText function to get the URL?
Like "get matchText(theText,"official website..(.*)xxx",theURL)" where xxx is some chars that comes after the URL.