livecode and *Readability.js* (='.'=)

bogs · Post by **bogs** » Fri Mar 16, 2018 9:47 pm

Mariasole wrote: ↑
Fri Mar 16, 2018 6:51 pm
hello nice big cat!

I did download the git repository source he had available, and decided to poke around in it with eclipse. My eclipse has gotten a lot more involved than I remembered it

But I am poking. May take a bit, will let you know. The format looks doable though, at least to my simple way of thinking.

bogs · Post by **bogs** » Sat Mar 17, 2018 4:33 pm

While poking around, I did come across some eariler posts about this kind of topic using pure Livecode here.

Among the many notable solutions was one posted by Mark Schonewille, which looks *very* good indeed and is well commented.

Code: Select all

/*
*** Formatting HTML code ***
indentHtml v.0.1.0, 13 June 2011
Adjust kTags to define tags that need indentation.
Parameters:
theHtml: any valid HTML source code;
theTabSpaces: the number of positions occupied by one tab character
theSoftWrapCol: the column (=position number) after which lines are wrapped
theReplaceTabs: replace tabs with spaces (true, false)
theDontBreakLinks: treat links as regular text, don't indent.
If you change kTags, you should be able to format any mark-up language as
long as it uses < and > to define tags.
By Mark Schonewille
Economy-x-Talk
Nijmegen, the Netherlands
http://economy-x-talk.com
For questions and comments, use the contact form on the website
*/
constant kTags = "html,body,head,title,a,b,i,font,h1,h2,h3,h4,h5,div,center,table,tr,td,th,form,select"
local myTagArray
function indentHtml theHtml,theTabSpaces,theSoftWrapCol,theReplaceTabs,theDontBreakLinks
     replace cr with empty in theHtml
     put 0 into myOldOffset
     put 0 into myPreviousOldOffset
     put 0 into myLevel
     put empty into myTabs
     repeat
          put offset("<",theHtml,myOldOffset) into myNewOffset -- open tag
          if myNewOffset is 0 then exit repeat else add myOldOffset to myNewOffset
          put offset(">",theHtml,myNewOffset) into myOldOffset -- close tag
          if myOldOffset is 0 then exit repeat else add myNewOffset to myOldOffset
          put word 1 of (char myNewOffset+1 to myOldOffset-1 of theHtml) into myTag
          put ((myTag is "a" or myTag is "/a") and theDontBreakLinks is true) into myCurrentTagIsLink
          ///if myTag is "/a" then breakpoint
          put char myPreviousOldOffset+1 to myNewOffset-1 of theHtml into myData
          if myData is not empty then
               put theSoftWrapCol - theTabSpaces*(number of chars of myTabs+1) into myTempWrapCol
               put 1 into myLineCounter
               if myCurrentTagIsLink or (myPreviousTag is "/a" and theDontBreakLinks is true) then
                    // put offset("<",myData) into myTagPosition
                    repeat
                         if number of chars of line myLineCounter of myData > myTempWrapCol and "<" is not in \
                                char 1 to myTempWrapCol of line myLineCounter of myData then
                              put cr & myTabs before word (number of words of (char 1 to myTempWrapCol of line \
                                     myLineCounter of myData) + 1) of line myLineCounter of myData
                              add 1 to myLineCounter
                         else
                              exit repeat
                         end if
                         
                    end repeat
                    if myTag is not "/a" and myPreviousTag is not "/a" then
                         put myTabs before line 1 of myData
                    end if
               else if theTabSpaces is a number and theTabSpaces > 0 and theSoftWrapCol is a number and \
                      theSoftWrapCol > 0 then
                    repeat
                         if number of chars of line myLineCounter of myData > myTempWrapCol then
                              put cr before word (number of words of (char 1 to myTempWrapCol of line \
                                     myLineCounter of myData) + 1) of line myLineCounter of myData
                              add 1 to myLineCounter
                         else
                              exit repeat
                         end if
                         put myTabs before line myLineCounter-1 of myData
                    end repeat
                    if last char of myData is cr then delete last char of myData
                    put myTabs before the last line of myData
               end if
               if myCurrentTagIsLink then
                    put myData after myNewData
               else
                    put myData & cr after myNewData
               end if
          end if
         
          if (char 1 of myTag is slash and char 2 to -1 of myTag is among the items of kTags) then
               subtract 1 from myLevel
               put empty into myTabs
               repeat myLevel
                    put tab after myTabs
               end repeat
          end if
          if myLevel < 0 then put 0 into myLevel
         
          if myCurrentTagIsLink then
               put char myNewOffset to myOldOffset of theHtml after myNewData
          else
               put myTabs & char myNewOffset to myOldOffset of theHtml & cr after myNewData
          end if
          put myOldOffset into myPreviousOldOffset
          // if opening tag then after
          if myTag is among the items of kTags then
               add 1 to myLevel
               put empty into myTabs
               repeat myLevel
                    put tab after myTabs
               end repeat
          end if
          put myTag into myPreviousTag
     end repeat
     put myTabs & char myOldOffset+1 to -1 of theHtml after myNewData
     if theTabSpaces is a number and theTabSpaces > 0 and theReplaceTabs is true then
          put empty into mySpaces
          repeat theTabSpaces
               put space after mySpaces
          end repeat
          replace tab with mySpaces in myNewData
     end if
     return myNewData
end indentHtml

Pmbrig also had a nice solution -

On Jun 12, 2011, pmbrig wrote: One technique is to replace a repeating string that sets off pieces of content you want to isolate with a single character, e.g.,

replace "<td valign="top" align=center>" with divChar in tText

then use that as an itemdelim or linedelim to access "item n of tText" or "line n of tText"

You need to make sure that the characters you will use as delimiters are not found in the text. For HTML, all you have to do is use any higher-ASCII characters. Here is a way of doing that that actually works also with text that may contain high-ASCII characters, but you can use this fine to parse HTML text.

Code: Select all

local lineChar, itemChar, cellDivChar
local textBlockDivider, frameDivider -- extras as needed

on assignDelims tText
   put getdelimiters(tText) into delimArray
   -- an array of high-ASCII characters not found in tText
   put "lineChar,itemChar, cellDivChar" into delimList
   -- add more if you want, these are
   put the number of items of delimList into nbrDelimsNeeded
   repeat with i = 1 to nbrDelimsNeeded
      put delimArray[i] into tDelim
      if tDelim = empty then
         answer "Could only assign" && i-1 && "out of" && \
                 nbrDelimsNeeded & "!"
         -- in case you have high nbrDelimsNeeded
         -- or tText contains lots of unusual characters,
         -- and the list of allowable delims is very short
         exit to top
      end if
      do "put tDelim into" && item i of delimList
   end repeat
   
   -- you can also do this manually, for one-off parsing jobs
   -- should check that delimArray[n] is not empty if in doubt
   -- (declare script local variables as needed)

   -- put delimArray[4] into textBlockDivider
   -- put delimArray[5] into frameDivider
   -- etc
end assignDelims

on getdelimiters tText
   if tText = empty then return empty
   put "ßπ∆ƒµ¡™£¢∞§¶ªç≈…æ∑ø©®Ω" into charList
   -- don't know if this will show well in all email clients
   -- it's a string of high-ASCII characters
   put 0 into tCount
   repeat for each char tChar in charList
      if tChar is in tText then next repeat
      add 1 to tCount
      put tChar into delimList[tCount]
   end repeat
   return delimList
end getdelimiters

   -- Then you can do things like:
   replace cr with empty in tText
   -- html ignores cr's, and extraneous returns may complicate things
   replace "<p>" with lineChar in tText
   replace "</font></td><td valign="top"><font face="Arial" size="-1">" with cellDivChar in tText
   -- or whatever the tag string is for this particular table
   -- then:

   set the lineDelimiter to lineChar
   set the itemDelimiter to cellDivChar
   repeat for each line tLine in tText
      repeat for each item textRun in tLine
      -- do more parsing stuff here:
      -- now you can work on each block textRun
       end repeat
   end repeat
/*
This helps to pare down some of the HTML formatting/tagging so as to use LC's powerful chunk manipulation to extract the content you want.

-- Peter
*/

Something to think about

FourthWorld · Post by **FourthWorld** » Sat Mar 17, 2018 7:41 pm

If I read the intro docs at the Google Code page for Boiler Pipe, it appears to be for extracting text sans formatting for passing off to full-text indexers, but does not attempt semantic analysis of portions of the extracted text.

Please point me to what I missed if I misread that. But if I didn't, it seems we have two separate challenges here:
1. Stripping HTML/CSS/JS from the contents of a web page.
2. Identifying specific content elements by semantic role ( title, summary and article were specified as desired elements a few posts back).

#1 is not hard, but should be done after #2, so we can take advantage of HTML tags to identify content element boundaries. Once a content element is extracted, setting the htmlText of the templateField to that HTML and then retrieving the text of the templateField will strip tags and return plain text (which MaxV had suggested here early on).

#2 is much harder, because there are varying conventions for semantic markup, and they are not not consistently used, if they're used at all.

A page's title is easy enough, and the offset function (or if you prefer regex the matchText function) can be used to identify the opening and closing tags to obtain that string.

"Summary" may be more difficult, unless the metatag often used for "description" will suffice, in which case it's no more difficult than extracting the title, done by pretty much the same means.

If you attempt to write a single routine which can extract an "article" from any web page, you will go insane and as you eek out your remaining existence in an asylum you will never find a satisfying solution to your dying day.

I wish I had better news, but there are no commonly-used conventions for identifying the boundaries between "article content" and navigational elements, pull quotes, header, footer, etc. HTML5 offers some nice assistance with its semantic tags, but even those are still flexible enough that it can be quite difficult to find a finite set of patterns for isolating "article" content. And those tags exist in only a minority of pages, so for most of the web you are, in a word, hosed.

If you have a specific collection of pages in mind for your mining, you may be able to identify common patterns among them. If so, the guidance about offset and matchText above could be applied for this task as well.

But if you find dizzying variance of tag patterns among web pages, the only consolation I can provide is that you're not alone: the web is full of articles about scraping strategies. I found this book helpful, but before you purchase anything try a few web searches for free info, there's plenty out there.
https://nostarch.com/webbots2

Unless you have the good fortune of working with a limited variety of patterns among the pages you need to parse, you may find an imperfect solution in first removing the things you know you don't need and hand-editing the remainder to clean up anything you may have missed.

Even better would be if the sources you're using also publish complete content in RSS format. RSS is horribly abused and so fragmented today it barely qualifies as a single thing (don't even get me started about the differences between "Really Simple Syndication" and "RDF Site Summary" - argh!!!!). But for all the abuse and misunderstood conventions, it's less hair-pulling than the vast wild jungle of varying HTML usage patterns.

bogs · Post by **bogs** » Sat Mar 17, 2018 8:14 pm

Sounds to me like you have the gist of it

Btw, I loved this line -

FourthWorld wrote: ↑
Sat Mar 17, 2018 7:41 pm
If you attempt to write a single routine which can extract an "article" from any web page, you will go insane and as you eek out your remaining existence in an asylum you will never find a satisfying solution to your dying day.

Already being nuts myself, I don't mind the mental torture, but I hope Mariasole understands if I don't spend quite *that* much time on it

bogs · Post by **bogs** » Sun Mar 18, 2018 2:38 pm

I'm going to have to admit defeat on using boiler pipe for now, Mariasole. Both my unfamiliarity with java and the way he wrote the source leaves me baffled, other than the earlier suggestions of using the html page api which was rather easy.

I did download the .js source that firefox actually uses for the reader part of the browser, and scanned that as well. It appears much as Richard pointed out, it doesn't filter every kind of page, which you can see for yourself. Pages have to be set up in a particular manner to be put into reader view, so it looks like you could set up the same rules in Lc or even redirect your inquiry through the readerview .js file, but then you are no farther ahead than if you were redirecting through that boilerpipe webpage again.

Now, the above isn't to say give up all hope, I am sure someone on these boards is far more fluent in Java than I am, it is possible someone else may have the answer for you at some point.

I'll take a further look at firefox's readerview file to see if anything comes of that.

Mariasole · Post by **Mariasole** » Mon Mar 19, 2018 9:15 am

Thanks Bogs! Thanks Richard!
I have carefully read everything, both the logical steps of Richard and the heroic work of Bogs (there is no eclipse in your mind, the sun always shines

). I really thank you for your help.

In reality, however, I think we are going forward

, given that, set aside for now boilerpipe because Java remains a mystery for us artists of Hypercard

, but remains a hope in Javascript, I think much more affordable!

I think the Firefox resource that found Bogs is this:

Readability.js [A standalone version of the readability library used for Firefox Reader View. ]
https://github.com/mozilla/readability

From what I understand is in javascript

, could you interface the some way?

What I would like to do is this:
a) LC or curl download the page and get the "raw" HTML code in the field "INPUT" (up to here I can do it, after years!

)
b) Readability.js deals with processing the text "extract" and returns it into field "OUTPUT"

And now the usual eternal beginner question ... how do I start and control Readability.js

Thanks for everything!
Thank you all!
Thanks to my big cat!

Mariasole
(='.'=)

bogs · Post by **bogs** » Mon Mar 19, 2018 4:07 pm

That is the very .js file I am indeed looking at.

I know that it isn't geared to open a local page, but it might be possible to either jigger it or use it in a way that wasn't originally intended or, of course, just understand the problems involved better and apply them elsewhere.

Mariasole · Post by **Mariasole** » Mon Mar 19, 2018 6:59 pm

Exactly! Perfect approach Bogs!

Tonight I print the code listings

.
I take a highlighter marker and try to understand something

.
Even if I do not know anything about javascript, maybe I can guess some steps that can be useful.

Javascript ad sensum, said the ancient Romans

Thanks my lovely Abraham de Lacy Giuseppe Casey Thomas O'Malley!

Mariasole
(='.'=)

bogs · Post by **bogs** » Mon Mar 19, 2018 7:09 pm

Mariasole wrote: ↑
Mon Mar 19, 2018 6:59 pm
Even if I do not know anything about javascript, maybe I can guess some steps that can be useful.

I think so

Mariasole · Post by **Mariasole** » Thu Mar 22, 2018 12:22 pm

I have read all the list of code of "Readability.js" and I am isolating the functions.

Although, due to my total ignorance of javascript, I do not understand how they are recalled.
I can not therefore understand the scheme. But roughly the internal functioning of the functions I intuit it.

I wonder if, however, it is possible to "interface" this library to the browser widget.
In fact, I read in the readme of Readability.js:

To parse a document, you must create a new `Readability` object from a URI object and a document object, and then call `parse()`. Here's an example:

```javascript
var loc = document.location;
var uri = {
spec: loc.href,
host: loc.host,
prePath: loc.protocol + "//" + loc.host,
scheme: loc.protocol.substr(0, loc.protocol.indexOf(":")),
pathBase: loc.protocol + "//" + loc.host + loc.pathname.substr(0, loc.pathname.lastIndexOf("/") + 1)
};
var article = new Readability(uri, document).parse();
```

This `article` object will contain the following properties:

* `uri`: original `uri` object that was passed to constructor
* `title`: article title
* `content`: HTML string of processed article content
* `length`: length of article, in characters
* `excerpt`: article description, or short excerpt from content
* `byline`: author metadata
* `dir`: content direction

If you're using Readability on the web, you will likely be able to use a `document` reference from elsewhere (e.g. fetched via XMLHttpRequest, in a same-origin `<iframe>` you have access to, etc.). Otherwise, you would need to construct such an object using a DOM parser such as [jsdom](https://github.com/tmpvar/jsdom). While this repository contains a parser of its own (`JSDOMParser`), that is restricted to reading XML-compatible markup and therefore we do not recommend it for general use.

If I could interface this library with the browser widget, due to subtraction of function in the javascript code library, I would be able to understand how it works and make a porting on Livecode (obviously in a couple of years!

)

Is it theoretically possible to make this library dialogue with the widget?

Thank you all in advance!

Mariasole
(='.'=)

bogs · Post by **bogs** » Thu Mar 22, 2018 1:12 pm

Heya Mariasole, I didn't forget about this project, currently parsing this information (you may want to look through it too).

Mariasole · Post by **Mariasole** » Thu Mar 22, 2018 3:28 pm

Wow! Thank you! It's nice that we study together!

In short, you are the teacher and I am the student!
Thanks Professor Bogs, now I start studying.
Thank you so much!

Mariasole
(='.'=)

bogs · Post by **bogs** » Thu Mar 22, 2018 3:41 pm

LOL

More likely I am a highly distracted unable to focus nutcase who will lead you down rabbit hole after rabbit hole until I get lucky (or not)

I'll leave professering to Richmond

Mariasole · Post by **Mariasole** » Mon Mar 26, 2018 3:55 pm

I found some interesting sources that can help us to understand better.

Having no programming experience other than in LiveCode(

), I consider porting in other languages as a sort of Rosetta Stone.

Here's what I found:

Readability.php
https://github.com/andreskrey/readability.php

python-readability
https://github.com/buriy/python-readability

Ruby Readability
https://github.com/cantino/ruby-readability

And last but not least...
Readability 1.0
(Source code to the original Arc90 Experiment, which was transformed into the Redability.com product, incorporated into Safari's Reader view, Flipboard, and Treesaver.)
https://github.com/masukomi/ar90-readability

Peace and love and may God bless you all
Mariasole
(='.'=)

bogs · Post by **bogs** » Fri Apr 06, 2018 6:04 am

Mariasole wrote: ↑
Tue Mar 13, 2018 10:09 am
I read the article: https://livecode.com/infinite-livecode-java-progress/
I do not understand anything about Java ... but I think that what I ask can not be done!
Mariasole
(='.'=)

Hmmm, maybe it *can* be done now. While I was browsing through Max's wiki, I came across this in reference to the Foreign Function Interface built into Lc9. I didn't understand it all mind you.

*Edit - still working on the .js angle myself, I didn't forget

LiveCode Forums

livecode and Readability.js (='.'=)

Re: livecode and JAVA boilerpipe (='.'=)

Re: livecode and JAVA boilerpipe (='.'=)

Re: livecode and JAVA boilerpipe (='.'=)

Re: livecode and JAVA boilerpipe (='.'=)

Re: livecode and JAVA boilerpipe (='.'=)

Re: livecode and JAVA boilerpipe (='.'=)

Re: livecode and Firefox Readability.js (ex java boilerpipe) (='.'=)

Re: livecode and Firefox Readability.js (ex java boilerpipe) (='.'=)

Re: livecode and Firefox Readability.js (ex java boilerpipe) (='.'=)

Re: livecode and Firefox Readability.js (ex java boilerpipe) (='.'=)

Re: livecode and Readability.js (='.'=)

Re: livecode and Readability.js (='.'=)

Re: livecode and Readability.js (='.'=)

Re: livecode and Readability.js (='.'=)

Re: livecode and JAVA boilerpipe (='.'=)