Get all code in a HTML div

andrewferguson · Post by **andrewferguson** » Wed Oct 15, 2014 6:00 pm

Hi,
I have a large amount of HTML files that I need to process. One thing that I need to do is extract the code from a div. Every HTML file contains the div, which can be identified as it starts off with "<div id='wikitext' >'". Being HTML, the div will end with "</div>". However, it is likely that some of the divs will contain divs inside them which will make extracting the code trickier. Is there a way to extract all the code from one HTML element, including the starting "<div id='wikitext'>" and the ending "</div>"?
Thanks,
Andrew

[-hh] · Post by **[-hh]** » Wed Oct 15, 2014 9:41 pm

You could try this as a start, searches first such "div-group" in a string.
[May be that Thierry has a single regex-line for that?]

Code: Select all

on mouseUp
  put fld "IN" into s
  put replaceText(s,"<\s*div\s*", "<div ") into s
  put replaceText(s,"<\s*/div\s*>", "</div>") into s
  put replaceText(s,"<\s*div\s*id='wikitext'\s*>", "<div id='wikitext'>") into s
  put "<div id='wikitext'" into startX
  put "<div" into leftX; put "</div>" into rightX
  put scanForNextGroup(startX,leftX,rightX,s) into fld "OUT"
end mouseUp

-- Scans for content of first group at level 1 (= not nested)
--   defined by startString t0, endDelimiter t2
--   ignoring other groups defined by delimiter strings t1,t2
function scanForNextGroup t0,t1,t2,s
  put length(t1) into l1
  put length(t2) into l2
  put length(t0) into l0
  put offset (t0,s) into o0
  if o0=0 then return "No such group"
  put o0+l0-1 into o3; put o0 into o30
  put 1 into j ## <-- now your group is open
  repeat
    if the shiftkey is down then exit repeat # <-- use as EXIT when testing
    put offset (t1,s,o3) into o1; put offset (t2,s,o3) into o2
    if o1=0 and o2=0 then exit repeat
    if o2 < o1 or o1=0 then
      subtract one from j; add o2+l2-1 to o3
    else
      add one to j; add o1+l1-1 to o3
    end if
    if j=0 then return char o30 to o3 of s ## <-- now your group is closed
  end repeat
  return "group not closed with " &j & " 'open divs' "
end scanForNextGroup

Remarks.
= If you use Unicode, that has to be (slightly) adjusted.
= This returns the first such group at level 1, that is, if another such group is opened inside the group then this is returned also as content and you have to scan the returned text again ... and so on, like with an expression with parentheses "( )" at several levels; there you would like to get the text between outermost parentheses and then go stepwise to inner levels.

[Edit. Added "remarks" above. Updated the 'preparing' mouseUp handler by replaceText-lines. Removed a note.]

dunbarx · Post by **dunbarx** » Wed Oct 15, 2014 9:56 pm

Hermann.

It sure is. I think you should post this directly to the quality control center. As Richard points out, that location also accepts enhancement requests, and I bet that venue is more visible to the Rev team than the one in the forum. Mark Waddingham replied to my original super-delimiter request personally, and I think your enhanced enhancement would be just the sort of thing he would appreciate.

Craig

[-hh] · Post by **[-hh]** » Wed Oct 15, 2014 10:47 pm

Life is a thing of give and take, not only for us.

andrewferguson · Post by **andrewferguson** » Thu Oct 16, 2014 8:55 am

Hi Hermann,
Thanks very much, you solution works perfectly!
Andrew

[-hh] · Post by **[-hh]** » Thu Oct 16, 2014 4:07 pm

Hi Andrew,

I found still some things to "fine-tune":
There is no problem if you have 'machine'-generated Code. But there is sometimes 'manually' written HTML code, as you and me presumably do.

To avoid whitespace problems because a HTML parser accepts also "</div >" or "</div" & cr & ">" etc.
Start with a regex that removes all unwanted whitespace (incl. cr) from your defining starting/opening/closing "pattern"(=delimiting strings):
put replaceText(s,"<\s*div\s*", "<div ") into s
put replaceText(s,"<\s*/div\s*>", "</div>") into s
put replaceText(s,"<\s*div\s*id='wikitext'\s*>", "<div id='wikitext'>") into s

I updated accordingly the code above in my first post.

Hermann

andrewferguson · Post by **andrewferguson** » Fri Oct 17, 2014 12:42 pm

Thanks Hermann! You make using LiveCode so much easier.
I think that the HTML code I have been processing is machine generated, but just to be on the safe side I have added in your extra code.
Andrew

[-hh] · Post by **[-hh]** » Mon Oct 20, 2014 6:12 pm

Yesterday I used this by myself for a similar task and extended the script to collect ALL those 'outermost' divs (that is 'at level 1').

The occurences are enumerated and the char ranges of them in the original input data (after replacing some whitespace!!) are also collected.

You can use them at once (see demo script below) or get the content and the charRange of a single one, say of "your-div" number two (gContent[2] and gRange[2]).

The script below is complete, applied to your original question.
Currently the arrays that hold the 'divs' and their 'ranges' are local variables only but could be saved to custom properties (or files).

Code: Select all

-- The following worked for me.
-- Everybody who intends to use it: Please test it first (with your testdata).

local gContent -- gContent[i] holds group i
local gRange   -- gRange[i] holds (start,stop) of group i within S

on mouseUp
  put fld "IN" into s0
  put replaceText(s0,"<\s*div\s*", "<div ") into s0
  put replaceText(s0,"<\s*/div\s*>", "</div>") into s0
  put replaceText(s0,"<\s*div\s*id='wikitext'\s*>", "<div id='wikitext'>") into s0
  put s0 into fld "IN" # <-- need this to have correct values in gRange !!
  put "<div id='wikitext'" into startX
  put "<div" into leftX; put "</div>" into rightX
  set cursor to watch
  put empty into gContent; put empty into gRange
  scanForAllGroups startX,leftX,rightX,s0
  put the keys of gContent into myK; sort myK numeric
  put empty into s0
  repeat for each line k in myK
      put cr & "#M_" & k & ": " & gRange[k] & cr & gContent[k] after s0
  end repeat
  delete char 1 of s0
  put s0 into fld "OUT"
end mouseUp

-- Scans for content of all group at level 1 (= not nested)
--   in S, groups defined by startString t0, endDelimiter t2
--   ignoring other groups defined by delimiter strings t1,t2
private command scanForAllGroups t0,t1,t2,S
  -- initialize variables
  if i0 is empty then put 0 into i0
  put 0 into i; put 0 into o30 -- i counts the groups, o30 is a char marker 
  put len(t1) into l1; put len(t2) into l2; put len(t0) into l0
  repeat
    put offset (t0,S,o30) into o0
    put true into exxit
    if o0=0 then exit repeat -- "No such group found."
    if the shiftkey is down then exit repeat # <-- use as exit when testing
    # -- here starts the scan for group i
    add 1 to i; put 1 into j -- group at level 1 is open
    -- j even: group at level j is open; j odd: group at level 1+j is closed
    put o30+o0+l0-1 into o3; add o0 to o30
    repeat
      put true into exxit
      if the shiftkey is down then exit repeat# <-- use as exit when testing
      put offset (t1,S,o3) into o1; put offset (t2,S,o3) into o2
      if o1=0 and o2=0 then exit repeat
      if o2 < o1 or o1=0 then
        subtract one from j; add o2+l2-1 to o3
      else
        add one to j; add o1+l1-1 to o3
      end if
      if j=0 then
        put char o30 to o3 of s into gContent[i] -- div content
        put (o30,o3) into gRange[i] -- the div start and end in S
        put o3 into o30; put false into exxit; exit repeat
      else 
        put "group not closed with " &j & " 'open divs'." into gContent[i]
        put (o30,o3) into gRange[i]
      end if
    end repeat
    if exxit then
      put "exited by shiftkey" into gContent[1]
      put (0,0) into gRange[1]
      put false into exxit; exit repeat
    end if
    # -- here ends the scan for group i
  end repeat
  if exxit then
    put "exited by shiftkey" into gContent[1]
    put (0,0) into gRange[1]
  end if
end scanForAllGroups

LiveCode Forums

Get all code in a HTML div

Get all code in a HTML div

Re: Get all code in a HTML div

Re: Get all code in a HTML div

Re: Get all code in a HTML div

Re: Get all code in a HTML div

Re: Get all code in a HTML div

Re: Get all code in a HTML div

Re: Get all code in a HTML div