Regex / variable line extraction

Anything beyond the basics in using the LiveCode language. Share your handlers, functions and magic here.

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

Post Reply
SparkOut
Posts: 2949
Joined: Sun Sep 23, 2007 4:58 pm

Regex / variable line extraction

Post by SparkOut » Fri Apr 24, 2015 9:44 pm

I am struggling with the bounds of my regex knowledge, and using online regex testers I can get successful matches for single lines but having problems with greediness and with setting the single line mode. I need to extract some text from the html of a wordpress site, over which I have no control.

An extract of the text I have to parse is

Code: Select all

	<div class="wpb_text_column wpb_content_element ">
		<div class="wpb_wrapper">
			<h2>Overview</h2>

		</div> 
	</div> 
<div class="vc_row wpb_row vc_inner">
    <div class="vc_row-fluid clr">
                    
	<div class="vc_col-sm-9 wpb_column clr column_container  " style="margin-bottom:0px;"><div class="clr " style="padding-bottom:0px;">
			
	<div class="wpb_text_column wpb_content_element ">
		<div class="wpb_wrapper">
			<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa.</p>
<p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu.</p>
<p>In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus.</p>
<p>Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. Phasellus viverra nulla ut metus varius laoreet.</p>
<p>Quisque rutrum. Aenean imperdiet. Etiam ultricies nisi vel augue. Curabitur ullamcorper ultricies nisi. Nam eget dui. Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum.</p>

		</div> 
	</div> </div>
	</div> 
I need to extract the (variable number of) lines of text in the <p> tags. I have spent a lot of time on regex tutorials and I am still bemused, I am led to believe that

Code: Select all

<h2>Overview<\/h2>.+?<\/div>.+?<\/div>.+?<div class=.+?<div class=.+?<div class=.+?<div class=.+?<div class=.+?<p>(.*)<\/p>/s
should approach success, being able to identify the start of the region required by the Overview heading, and then with .* being greedy, I should instead use .+? to match the plain text or whitespace or newline, given the /s setting. Using .* in place of .+? doesn't work either. Nor do any other combinations I have been able to make except

Code: Select all

<h2>Overview<\/h2>\s*<\/div>\s*<\/div>\s*<div class=.*\s*<div class=.*\s*<div class=.*\s*<div class=.*\s*<div class=.*\s*<p>(.*)<\/p>
which works, but only for one line, and it fails if I add the /s setting.
Where am I going wrong?
Thanks anyone (*coughThierrycough*)

Thierry
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 875
Joined: Wed Nov 22, 2006 3:42 pm

Re: Regex / variable line extraction

Post by Thierry » Sat Apr 25, 2015 1:59 pm

Hi SparkOut,

Here is some medecine against coughing:

Code: Select all

   get "(?ms)<h2>Overview</h2>.+?<div\s+class=.wpb_wrapper.>(.+?)</div>"
   if matchText( htmlText, IT, OverviewParagraphs) then
      replace "<p>" with empty in OverviewParagraphs
      replace "</p>" with cr in OverviewParagraphs
   end if

Regards,

Thierry
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!

SparkOut
Posts: 2949
Joined: Sun Sep 23, 2007 4:58 pm

Re: Regex / variable line extraction

Post by SparkOut » Sat Apr 25, 2015 2:26 pm

Thanks for the incredibly fantastic help Thierry, on and off list.
For the record, it's been an education in the way LiveCode implements options unlike the online regex testers (?ms) at the start of the regex instead of /s etc at the end. Also an eye-opener in understanding the greediness of choices (.*) vs (.+?) for instance.
Merci encore Thierry

Post Reply