Page 1 of 1

Screen scraping or interpereting data from a website

Posted: Sun Aug 21, 2016 6:10 pm
by Jordy
Hi,

Is there a way to access the text on a website so that I can analyze and work with it. For instance trying to figure out what information a web form is asking for.

Not sure what capabilities livecode already has. Worst case scenario I was considering getting the source code from the website and parsing it to interpret the website.


THANKS

Re: Screen scraping or interpereting data from a website

Posted: Mon Aug 22, 2016 2:44 pm
by Mikey
I typically use other tools to extract the scrape, and then LC to analyze it, but, you can use "put url" to get the data from a URL.
I haven't tried it yet, because I just noticed this, last night, but the source for the browser widget is available in 8 right in the application bundle, so my long-delayed dream of using LC to scrape directly might be closer, once I see what the source is doing...

Re: Screen scraping or interpereting data from a website

Posted: Mon Aug 22, 2016 5:03 pm
by FourthWorld
What tools do you use for scraping, Mikey?

Re: Screen scraping or interpereting data from a website

Posted: Mon Aug 22, 2016 5:16 pm
by Mikey
There are many, many tools. As I mentioned, I have used "put url" in LC, but if I'm doing a big scrape (think hundreds of thousands of records), the one I like the best is a plugin for Chrome called... "Web Scraper", from Martins Balodis. It takes a little fiddling, but once you get it set up, it works great, even when scraping huge sites, and it lets you set the delay between pages so that you don't piss off the operator by burning their pipe down. Martins has both a paid and free version. The paid version works from one of his servers, the free version from your machine. When you're done you end up with a csv file. You can have multiple scrapes going in different tabs at the same time. Note that if you are trying to do a big scrape on a single URL, breaking it into sections can be tricky, but if not, then you're grinning.

I have also paid him to write a custom scrape configuration, in the case where things were more complicated than what I was able to figure out. The price was cheap, I thought, and after I saw it, I gained new insight into how to use the tool, so now I'm generally able to write my own scripts without much difficulty, even for the most complex sites.

Re: Screen scraping or interpereting data from a website

Posted: Mon Aug 22, 2016 6:20 pm
by jacque
This will give you the plain text:

Code: Select all

put url tURL into tData 
set the htmltext of the templateField to tData 
put the text of the templateField into tPlainText
Now you can parse the plain text to see what's there.

Re: Screen scraping or interpereting data from a website

Posted: Mon Aug 22, 2016 7:27 pm
by Mikey
By the way, when I mean the source for the browser widget, I mean the LCB source, not C++ or OC, for those who are wondering. The reason to NOT use the "put url" technique is for cases where there is JS framework that has to be executed in the browser, as well. In many of those cases, the data is not retrieved when you retrieve the page source. In those cases, the meat, i.e. the data has to be separately pulled by the browser. In those cases, you can either read through the JS to figure out how to write the code to get what you want, or you can use a browser to get the net result and pull data from that.

Re: Screen scraping or interpereting data from a website

Posted: Mon Aug 22, 2016 7:33 pm
by FourthWorld
Mikey wrote:The reason to NOT use the "put url" technique is for cases where there is JS framework that has to be executed in the browser, as well. In many of those cases, the data is not retrieved when you retrieve the page source. In those cases, the meat, i.e. the data has to be separately pulled by the browser. In those cases, you can either read through the JS to figure out how to write the code to get what you want, or you can use a browser to get the net result and pull data from that.
Given the growing need for JS blockers like NoScript, the prudent business owner should consider content requiring JS to be a bug.