Regex / variable line extraction
Posted: Fri Apr 24, 2015 9:44 pm
I am struggling with the bounds of my regex knowledge, and using online regex testers I can get successful matches for single lines but having problems with greediness and with setting the single line mode. I need to extract some text from the html of a wordpress site, over which I have no control.
An extract of the text I have to parse isI need to extract the (variable number of) lines of text in the <p> tags. I have spent a lot of time on regex tutorials and I am still bemused, I am led to believe thatshould approach success, being able to identify the start of the region required by the Overview heading, and then with .* being greedy, I should instead use .+? to match the plain text or whitespace or newline, given the /s setting. Using .* in place of .+? doesn't work either. Nor do any other combinations I have been able to make except which works, but only for one line, and it fails if I add the /s setting.
Where am I going wrong?
Thanks anyone (*coughThierrycough*)
An extract of the text I have to parse is
Code: Select all
<div class="wpb_text_column wpb_content_element ">
<div class="wpb_wrapper">
<h2>Overview</h2>
</div>
</div>
<div class="vc_row wpb_row vc_inner">
<div class="vc_row-fluid clr">
<div class="vc_col-sm-9 wpb_column clr column_container " style="margin-bottom:0px;"><div class="clr " style="padding-bottom:0px;">
<div class="wpb_text_column wpb_content_element ">
<div class="wpb_wrapper">
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa.</p>
<p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu.</p>
<p>In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus.</p>
<p>Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. Phasellus viverra nulla ut metus varius laoreet.</p>
<p>Quisque rutrum. Aenean imperdiet. Etiam ultricies nisi vel augue. Curabitur ullamcorper ultricies nisi. Nam eget dui. Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum.</p>
</div>
</div> </div>
</div>
Code: Select all
<h2>Overview<\/h2>.+?<\/div>.+?<\/div>.+?<div class=.+?<div class=.+?<div class=.+?<div class=.+?<div class=.+?<p>(.*)<\/p>/s
Code: Select all
<h2>Overview<\/h2>\s*<\/div>\s*<\/div>\s*<div class=.*\s*<div class=.*\s*<div class=.*\s*<div class=.*\s*<div class=.*\s*<p>(.*)<\/p>
Where am I going wrong?
Thanks anyone (*coughThierrycough*)