If I read the intro docs at the Google Code page for Boiler Pipe, it appears to be for extracting text sans formatting for passing off to full-text indexers, but does not attempt semantic analysis of portions of the extracted text.
Please point me to what I missed if I misread that. But if I didn't, it seems we have two separate challenges here:
1. Stripping HTML/CSS/JS from the contents of a web page.
2. Identifying specific content elements by semantic role ( title, summary and article were specified as desired elements a few posts back).
#1 is not hard, but should be done after #2, so we can take advantage of HTML tags to identify content element boundaries. Once a content element is extracted, setting the htmlText of the templateField to that HTML and then retrieving the text of the templateField will strip tags and return plain text (which MaxV had suggested here early on).
#2 is much harder, because there are varying conventions for semantic markup, and they are not not consistently used, if they're used at all.
A page's title is easy enough, and the offset function (or if you prefer regex the matchText function) can be used to identify the opening and closing tags to obtain that string.
"Summary" may be more difficult, unless the metatag often used for "description" will suffice, in which case it's no more difficult than extracting the title, done by pretty much the same means.
If you attempt to write a single routine which can extract an "article" from any web page, you will go insane and as you eek out your remaining existence in an asylum you will never find a satisfying solution to your dying day.
I wish I had better news, but there are no commonly-used conventions for identifying the boundaries between "article content" and navigational elements, pull quotes, header, footer, etc. HTML5 offers some nice assistance with its semantic tags, but even those are still flexible enough that it can be quite difficult to find a finite set of patterns for isolating "article" content. And those tags exist in only a minority of pages, so for most of the web you are, in a word, hosed.
If you have a specific collection of pages in mind for your mining, you may be able to identify common patterns among them. If so, the guidance about offset and matchText above could be applied for this task as well.
But if you find dizzying variance of tag patterns among web pages, the only consolation I can provide is that you're not alone: the web is full of articles about scraping strategies. I found this book helpful, but before you purchase anything try a few web searches for free info, there's plenty out there.
Unless you have the good fortune of working with a limited variety of patterns among the pages you need to parse, you may find an imperfect solution in first removing the things you know you don't need and hand-editing the remainder to clean up anything you may have missed.
Even better would be if the sources you're using also publish complete content in RSS format. RSS is horribly abused and so fragmented today it barely qualifies as a single thing (don't even get me started about the differences between "Really Simple Syndication" and "RDF Site Summary" - argh!!!!). But for all the abuse and misunderstood conventions, it's less hair-pulling than the vast wild jungle of varying HTML usage patterns.