Get all code in a HTML div
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller
-
- VIP Livecode Opensource Backer
- Posts: 184
- Joined: Wed Apr 10, 2013 5:09 pm
Get all code in a HTML div
Hi,
I have a large amount of HTML files that I need to process. One thing that I need to do is extract the code from a div. Every HTML file contains the div, which can be identified as it starts off with "<div id='wikitext' >'". Being HTML, the div will end with "</div>". However, it is likely that some of the divs will contain divs inside them which will make extracting the code trickier. Is there a way to extract all the code from one HTML element, including the starting "<div id='wikitext'>" and the ending "</div>"?
Thanks,
Andrew
I have a large amount of HTML files that I need to process. One thing that I need to do is extract the code from a div. Every HTML file contains the div, which can be identified as it starts off with "<div id='wikitext' >'". Being HTML, the div will end with "</div>". However, it is likely that some of the divs will contain divs inside them which will make extracting the code trickier. Is there a way to extract all the code from one HTML element, including the starting "<div id='wikitext'>" and the ending "</div>"?
Thanks,
Andrew
-
- VIP Livecode Opensource Backer
- Posts: 2262
- Joined: Thu Feb 28, 2013 11:52 pm
- Location: Göttingen, DE
Re: Get all code in a HTML div
You could try this as a start, searches first such "div-group" in a string.
[May be that Thierry has a single regex-line for that?]
Remarks.
= If you use Unicode, that has to be (slightly) adjusted.
= This returns the first such group at level 1, that is, if another such group is opened inside the group then this is returned also as content and you have to scan the returned text again ... and so on, like with an expression with parentheses "( )" at several levels; there you would like to get the text between outermost parentheses and then go stepwise to inner levels.
[Edit. Added "remarks" above. Updated the 'preparing' mouseUp handler by replaceText-lines. Removed a note.]
[May be that Thierry has a single regex-line for that?]
Code: Select all
on mouseUp
put fld "IN" into s
put replaceText(s,"<\s*div\s*", "<div ") into s
put replaceText(s,"<\s*/div\s*>", "</div>") into s
put replaceText(s,"<\s*div\s*id='wikitext'\s*>", "<div id='wikitext'>") into s
put "<div id='wikitext'" into startX
put "<div" into leftX; put "</div>" into rightX
put scanForNextGroup(startX,leftX,rightX,s) into fld "OUT"
end mouseUp
-- Scans for content of first group at level 1 (= not nested)
-- defined by startString t0, endDelimiter t2
-- ignoring other groups defined by delimiter strings t1,t2
function scanForNextGroup t0,t1,t2,s
put length(t1) into l1
put length(t2) into l2
put length(t0) into l0
put offset (t0,s) into o0
if o0=0 then return "No such group"
put o0+l0-1 into o3; put o0 into o30
put 1 into j ## <-- now your group is open
repeat
if the shiftkey is down then exit repeat # <-- use as EXIT when testing
put offset (t1,s,o3) into o1; put offset (t2,s,o3) into o2
if o1=0 and o2=0 then exit repeat
if o2 < o1 or o1=0 then
subtract one from j; add o2+l2-1 to o3
else
add one to j; add o1+l1-1 to o3
end if
if j=0 then return char o30 to o3 of s ## <-- now your group is closed
end repeat
return "group not closed with " &j & " 'open divs' "
end scanForNextGroup
= If you use Unicode, that has to be (slightly) adjusted.
= This returns the first such group at level 1, that is, if another such group is opened inside the group then this is returned also as content and you have to scan the returned text again ... and so on, like with an expression with parentheses "( )" at several levels; there you would like to get the text between outermost parentheses and then go stepwise to inner levels.
[Edit. Added "remarks" above. Updated the 'preparing' mouseUp handler by replaceText-lines. Removed a note.]
Last edited by [-hh] on Sat Oct 18, 2014 11:16 am, edited 3 times in total.
shiftLock happens
-
- VIP Livecode Opensource Backer
- Posts: 9665
- Joined: Wed May 06, 2009 2:28 pm
- Location: New York, NY
Re: Get all code in a HTML div
Hermann.
It sure is. I think you should post this directly to the quality control center. As Richard points out, that location also accepts enhancement requests, and I bet that venue is more visible to the Rev team than the one in the forum. Mark Waddingham replied to my original super-delimiter request personally, and I think your enhanced enhancement would be just the sort of thing he would appreciate.
Craig
It sure is. I think you should post this directly to the quality control center. As Richard points out, that location also accepts enhancement requests, and I bet that venue is more visible to the Rev team than the one in the forum. Mark Waddingham replied to my original super-delimiter request personally, and I think your enhanced enhancement would be just the sort of thing he would appreciate.
Craig
-
- VIP Livecode Opensource Backer
- Posts: 2262
- Joined: Thu Feb 28, 2013 11:52 pm
- Location: Göttingen, DE
Re: Get all code in a HTML div
Life is a thing of give and take, not only for us.
Last edited by [-hh] on Sat Oct 18, 2014 11:18 am, edited 1 time in total.
shiftLock happens
-
- VIP Livecode Opensource Backer
- Posts: 184
- Joined: Wed Apr 10, 2013 5:09 pm
Re: Get all code in a HTML div
Hi Hermann,
Thanks very much, you solution works perfectly!
Andrew
Thanks very much, you solution works perfectly!
Andrew
-
- VIP Livecode Opensource Backer
- Posts: 2262
- Joined: Thu Feb 28, 2013 11:52 pm
- Location: Göttingen, DE
Re: Get all code in a HTML div
Hi Andrew,
I found still some things to "fine-tune":
There is no problem if you have 'machine'-generated Code. But there is sometimes 'manually' written HTML code, as you and me presumably do.
To avoid whitespace problems because a HTML parser accepts also "</div >" or "</div" & cr & ">" etc.
Start with a regex that removes all unwanted whitespace (incl. cr) from your defining starting/opening/closing "pattern"(=delimiting strings):
put replaceText(s,"<\s*div\s*", "<div ") into s
put replaceText(s,"<\s*/div\s*>", "</div>") into s
put replaceText(s,"<\s*div\s*id='wikitext'\s*>", "<div id='wikitext'>") into s
I updated accordingly the code above in my first post.
Hermann
I found still some things to "fine-tune":
There is no problem if you have 'machine'-generated Code. But there is sometimes 'manually' written HTML code, as you and me presumably do.
To avoid whitespace problems because a HTML parser accepts also "</div >" or "</div" & cr & ">" etc.
Start with a regex that removes all unwanted whitespace (incl. cr) from your defining starting/opening/closing "pattern"(=delimiting strings):
put replaceText(s,"<\s*div\s*", "<div ") into s
put replaceText(s,"<\s*/div\s*>", "</div>") into s
put replaceText(s,"<\s*div\s*id='wikitext'\s*>", "<div id='wikitext'>") into s
I updated accordingly the code above in my first post.
Hermann
shiftLock happens
-
- VIP Livecode Opensource Backer
- Posts: 184
- Joined: Wed Apr 10, 2013 5:09 pm
Re: Get all code in a HTML div
Thanks Hermann! You make using LiveCode so much easier.
I think that the HTML code I have been processing is machine generated, but just to be on the safe side I have added in your extra code.
Andrew
I think that the HTML code I have been processing is machine generated, but just to be on the safe side I have added in your extra code.
Andrew
-
- VIP Livecode Opensource Backer
- Posts: 2262
- Joined: Thu Feb 28, 2013 11:52 pm
- Location: Göttingen, DE
Re: Get all code in a HTML div
Yesterday I used this by myself for a similar task and extended the script to collect ALL those 'outermost' divs (that is 'at level 1').
The occurences are enumerated and the char ranges of them in the original input data (after replacing some whitespace!!) are also collected.
You can use them at once (see demo script below) or get the content and the charRange of a single one, say of "your-div" number two (gContent[2] and gRange[2]).
The script below is complete, applied to your original question.
Currently the arrays that hold the 'divs' and their 'ranges' are local variables only but could be saved to custom properties (or files).
The occurences are enumerated and the char ranges of them in the original input data (after replacing some whitespace!!) are also collected.
You can use them at once (see demo script below) or get the content and the charRange of a single one, say of "your-div" number two (gContent[2] and gRange[2]).
The script below is complete, applied to your original question.
Currently the arrays that hold the 'divs' and their 'ranges' are local variables only but could be saved to custom properties (or files).
Code: Select all
-- The following worked for me.
-- Everybody who intends to use it: Please test it first (with your testdata).
local gContent -- gContent[i] holds group i
local gRange -- gRange[i] holds (start,stop) of group i within S
on mouseUp
put fld "IN" into s0
put replaceText(s0,"<\s*div\s*", "<div ") into s0
put replaceText(s0,"<\s*/div\s*>", "</div>") into s0
put replaceText(s0,"<\s*div\s*id='wikitext'\s*>", "<div id='wikitext'>") into s0
put s0 into fld "IN" # <-- need this to have correct values in gRange !!
put "<div id='wikitext'" into startX
put "<div" into leftX; put "</div>" into rightX
set cursor to watch
put empty into gContent; put empty into gRange
scanForAllGroups startX,leftX,rightX,s0
put the keys of gContent into myK; sort myK numeric
put empty into s0
repeat for each line k in myK
put cr & "#M_" & k & ": " & gRange[k] & cr & gContent[k] after s0
end repeat
delete char 1 of s0
put s0 into fld "OUT"
end mouseUp
-- Scans for content of all group at level 1 (= not nested)
-- in S, groups defined by startString t0, endDelimiter t2
-- ignoring other groups defined by delimiter strings t1,t2
private command scanForAllGroups t0,t1,t2,S
-- initialize variables
if i0 is empty then put 0 into i0
put 0 into i; put 0 into o30 -- i counts the groups, o30 is a char marker
put len(t1) into l1; put len(t2) into l2; put len(t0) into l0
repeat
put offset (t0,S,o30) into o0
put true into exxit
if o0=0 then exit repeat -- "No such group found."
if the shiftkey is down then exit repeat # <-- use as exit when testing
# -- here starts the scan for group i
add 1 to i; put 1 into j -- group at level 1 is open
-- j even: group at level j is open; j odd: group at level 1+j is closed
put o30+o0+l0-1 into o3; add o0 to o30
repeat
put true into exxit
if the shiftkey is down then exit repeat# <-- use as exit when testing
put offset (t1,S,o3) into o1; put offset (t2,S,o3) into o2
if o1=0 and o2=0 then exit repeat
if o2 < o1 or o1=0 then
subtract one from j; add o2+l2-1 to o3
else
add one to j; add o1+l1-1 to o3
end if
if j=0 then
put char o30 to o3 of s into gContent[i] -- div content
put (o30,o3) into gRange[i] -- the div start and end in S
put o3 into o30; put false into exxit; exit repeat
else
put "group not closed with " &j & " 'open divs'." into gContent[i]
put (o30,o3) into gRange[i]
end if
end repeat
if exxit then
put "exited by shiftkey" into gContent[1]
put (0,0) into gRange[1]
put false into exxit; exit repeat
end if
# -- here ends the scan for group i
end repeat
if exxit then
put "exited by shiftkey" into gContent[1]
put (0,0) into gRange[1]
end if
end scanForAllGroups
shiftLock happens