Browser widget : data mining on new sites ?

Bringing the internet highway into your project? Building FTP, HTTP, email, chat or other client solutions?

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

Post Reply
bangkok
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 937
Joined: Fri Aug 15, 2008 7:15 am

Browser widget : data mining on new sites ?

Post by bangkok » Wed Feb 01, 2023 6:57 am

More and more websites use special javascript/dynamic content, in order (i guess) to prevent data mining.

An example :
https://www.set.or.th/en/market/product ... /PTT/price

It's the new website of the thai stock market.

It's impossible anymore with LiveCode (with URL, htmltext or the browser widget) to analyse its content !

However, using the "developper" tools on Mozilla Firefox (not the tool "source code"), manually, it's possible to detect specific content (some tags, etc.)

My question is : is there a way to analyse the structure of what is displayed by the browser widget, within LiveCode ?

With the same level of details of Mozilla Firefox ?

SparkOut
Posts: 2834
Joined: Sun Sep 23, 2007 4:58 pm

Re: Browser widget : data mining on new sites ?

Post by SparkOut » Wed Feb 01, 2023 8:35 am

Every site with dynamic structure must be investigated individually to see what may or may not be possible.
In the case of Stock Exchange data, there is usually an API for non-human access. This might or might not be paywalled or subject to usage conditions.
In the case of the Thai market you link, I found a page
that includes this API information https://media.set.or.th/set/Documents/2 ... 0_FAQS.pdf
I didn't download it and I don't know whether the API even accesses the data you are after, but I think you should check what options for API connectivity you have.

Thomas seewald
Posts: 7
Joined: Mon Sep 26, 2016 12:06 pm

Re: Browser widget : data mining on new sites ?

Post by Thomas seewald » Wed Feb 01, 2023 6:17 pm

Hello, you know how to use do "as script" in widget "Browser"?
You have to use javascript commands to get the text you want because the text is loaded by javascript and not with the initial html.
put the htmltext of widget "Browser" does not help.

bangkok
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 937
Joined: Fri Aug 15, 2008 7:15 am

Re: Browser widget : data mining on new sites ?

Post by bangkok » Thu Feb 02, 2023 2:53 am

Thomas seewald wrote:
Wed Feb 01, 2023 6:17 pm
Hello, you know how to use do "as script" in widget "Browser"?
You have to use javascript commands to get the text you want because the text is loaded by javascript and not with the initial html.
put the htmltext of widget "Browser" does not help.
Ah non, i do not. It sounds interesting.

Do you have an example of code ? Thanks.

But in any case, since the widget executes everything (even javascript) and displays correctly all the data, why it's not possible to recover/extract easily the data from the widget, once the page is fully loaded ?

I know it's possible to create/export an image of a page displayed by the widget... why not text ?

Other weird thing : it's possible to do manually "CTRL A" (select all) inside the browser widget, and then copy/paste the content (text) !

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9801
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Browser widget : data mining on new sites ?

Post by FourthWorld » Thu Feb 02, 2023 4:00 am

Bangkok, unless you are expert with JavaScript you will not want to go this route. And even if you were, you would probably still not want to go this route. :)

This type of question comes up now and then, and nearly every time someone tries to solve a data retrieval problem by emulating user actions in a browser GUI rather than calling an API directly, the outcome ranges from disappointing to frustrating.

Here's some background:
https://forums.livecode.com/viewtopic.p ... 54#p218540

In short, there are just too many JS frameworks to be able to anticipate what you'll need to call and how you'll need to call it, and even if you jump through all those hoops the moment the page format changes in any way the scraper is dead and needs to be rewritten.

Find APIs. Use APIs. Love APIs.

There are APIs for the Thai stock market.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

bangkok
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 937
Joined: Fri Aug 15, 2008 7:15 am

Re: Browser widget : data mining on new sites ?

Post by bangkok » Thu Feb 02, 2023 7:49 am

FourthWorld wrote:
Thu Feb 02, 2023 4:00 am
Bangkok, unless you are expert with JavaScript you will not want to go this route. And even if you were, you would probably still not want to go this route. :)
Fair enough. ;-)

Perhaps the solution is to ask for an improvement of the browser widget ? A new property (text) if htmltext is not good enough ?

Yes, there are APIs (for stock prices), but i was interested with the news published by the listed companies.

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9801
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Browser widget : data mining on new sites ?

Post by FourthWorld » Thu Feb 02, 2023 10:21 am

bangkok wrote:
Thu Feb 02, 2023 7:49 am
FourthWorld wrote:
Thu Feb 02, 2023 4:00 am
Bangkok, unless you are expert with JavaScript you will not want to go this route. And even if you were, you would probably still not want to go this route. :)
Fair enough. ;-)

Perhaps the solution is to ask for an improvement of the browser widget ? A new property (text) if htmltext is not good enough ?[
It's not about any particular browser implementation. It's about the multiplicity of modern web frameworks.

Scraping is inherently problematic.

The links I provided offer more background.
Yes, there are APIs (for stock prices), but i was interested with the news published by the listed companies.
Sites that want their content syndicated generally provide APIs or RSS feeds for that. APIs are not hard to use in LC, and RSS is breezy fun.

Sites that don't provide either APIs or RSS feeds generally don't want their content syndicated.

It the site owner doesn't explicitly prohibit syndication in their TOS, consider writing them to suggest adding an RSS feed. With modern tooling it shouldn't be hard for nearly any site to add RSS.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

jacque
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 7210
Joined: Sat Apr 08, 2006 8:31 pm
Location: Minneapolis MN
Contact:

Re: Browser widget : data mining on new sites ?

Post by jacque » Thu Feb 02, 2023 7:51 pm

bangkok wrote:
Thu Feb 02, 2023 7:49 am
Perhaps the solution is to ask for an improvement of the browser widget ? A new property (text) if htmltext is not good enough ?
This converts html to text (pseudo code):

Code: Select all

Set the htmltext of the templateField to the htmltext of the widget. 
Get the text of the templateField. 
It won't always be perfect because LC fields don't support many browser tags.
Jacqueline Landman Gay | jacque at hyperactivesw dot com
HyperActive Software | http://www.hyperactivesw.com

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9801
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Browser widget : data mining on new sites ?

Post by FourthWorld » Thu Feb 02, 2023 9:04 pm

jacque wrote:
Thu Feb 02, 2023 7:51 pm
bangkok wrote:
Thu Feb 02, 2023 7:49 am
Perhaps the solution is to ask for an improvement of the browser widget ? A new property (text) if htmltext is not good enough ?
This converts html to text (pseudo code):

Code: Select all

Set the htmltext of the templateField to the htmltext of the widget. 
Get the text of the templateField. 
It won't always be perfect because LC fields don't support many browser tags.
A good solution but to a different problem.

The challenge Bangkok is facing is that the page contains little content, instead using JavaScript calls to pull in the content from the server on the fly during rendering, using APIs.

So downloading the page will get you the JavaScript function names and their params, but no content.

The trick in attempting scraping in a dynamically rendered page is to figure out the JS framework being used, and attempt to inject instructions from LC into the page to trigger the JS rendering.

Alternatively, one could make a request to the browser widget for the final rendered page instead of the page's actual source, which is likely what Bangkok had in mind when considering an enhancement request for the browser widget.

But even that wouldn't be a magic panacea. It would still be difficult to get working right, difficult to parse, and perpetually subject to change (even more likely than static pages may change, given that a page that looks the same may be driven by a framework that changed).

And with all due respect to those who keep looking for ways to avoid using APIs, the technical challenges are an indicator of site policy.

Sites that encourage syndication usually do so actively, providing APIs and/or RSS to explicitly and easily support that.

Sites that don't usually have a very different policy.

In all cases, it's useful to review the site's terms of service. Where the content owners may be open to considering syndication, they're often receptive to a gentle nudge to provide RSS to make that happen easily, and in a way that retains their governance over what is shared.

Bonus points that RSS makes a developer's job super easy.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

bangkok
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 937
Joined: Fri Aug 15, 2008 7:15 am

Re: Browser widget : data mining on new sites ?

Post by bangkok » Mon Feb 06, 2023 3:21 am

jacque wrote:
Thu Feb 02, 2023 7:51 pm
This converts html to text (pseudo code):

Code: Select all

Set the htmltext of the templateField to the htmltext of the widget. 
Get the text of the templateField. 
It won't always be perfect because LC fields don't support many browser tags.
Thanks Jacques for this method.

Post Reply

Return to “Internet”