Not sure if you managed to solve this, but I'm facing the same problem, and I think I've found the solution, hoping you (Or anybody) can help validate it.
It seems that revBrowser and it's ability to get the source HTML is spotty at best.
I found some times it worked, some times it didn't, and subsequent calls to the same instance returns different answers!
So, instead, (Since I don't need to actually RENDER the HTML), I'm using libURLDownloadToFile to retrieve the HTML directly from the server to a local file, which I then open, read & process.
In my process, I split the HTML by ">" then, for each item if it starts with "A" or "a", split that item by quotes (To separate the Names and Values of the attributes):
Code: Select all
put field htmlSource into theElements
split theElements by ">"
repeat for each element thisElement in theElements
if char 1 to 2 of thisElement = "<A" or "<a" then
split thisElement by space
repeat for each element thisAttribute in thisElement
if char 1 to 4 of thisAttribute = "href" or "HREF" then
split thisAttribute by quote
put thisAttribute[2] & return after theURLS
end if
end repeat
end if
end repeat
put theURLS into field "listURLS"
Of course many of the HREF's are like "/News" and "/Comments" , etc...
So I'm adding the root URL in front to build out the entire URL:
Code: Select all
on getMyURLs
put field "tbURL" into rootURL
--Remove any trailing slash
if rootURL ends with "/" then
put char 1 to (the length of rootURL - 1) of rootURL into rootURL
end if
set the itemdelimiter to return
repeat for each item thisURL in field "listURLS"
--If it's already formatted, and we've not been there, grab it.
if thisURL begins with rootURL then
if thisURL is not among the items of field "listMyCollectedURLS" then
put thisURL & return after field "listMyCollectedURLS"
end if
end if
--If it's relative, fix it, if we've not been there, grab it.
if thisURL begins with "/" then
put rootURL & thisURL into newURL
if newURL is not among the items of field "listMyCollectedURLS" then
put newURL & return after field "listMyCollectedURLS"
end if
end if
if thisURL begins with "./" then
put rootURL & "/" & thisURL into newURL
if newURL is not among the items of field "listMyCollectedURLS" then
put newURL & return after field "listMyCollectedURLS"
end if
end if
end repeat
end getMyURLs
So far this seems to be working, I'm now working on automating this process and then validating it against more web sites.
Of course this won't help if links are performed via anything other than pure HTML, so some sites may have incomplete scraping, but I doubt it's possible to handle ALL of those cases with any technology...
I'm VERY new to LiveCode, so I hope this helps, and if anybody has suggestions for improvements, I'm more than interested in hearing of them.
Thanks!
...Jeff