Hi,
Does anyone have any code to strip image urls out of a html page and put them in an array.
Looking to write a small piece of code to be able to download all of the images on a patricular (customisable) web page, I can write the code to download them but am struggling on parsing the html to extract just the image src tags.
Getting image urls out of html page
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller
Re: Getting image urls out of html page
Sounds like a job that could be done with regex?
John Gruber (inventor of Markdown and blog writer worth following) has a post on just this: https://daringfireball.net/2009/11/libe ... ching_urls
EDIT: regex pasted here:
Apparently this picks up every known variation of URLs, including some wild ones I was unaware of. Some explanation of this in the linked post, for things like this I use https://regex101.com to analyse and test.
It should be possible to adapt to LC as this regex is fully PCRE compatible…
John Gruber (inventor of Markdown and blog writer worth following) has a post on just this: https://daringfireball.net/2009/11/libe ... ching_urls
EDIT: regex pasted here:
Code: Select all
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
It should be possible to adapt to LC as this regex is fully PCRE compatible…
Re: Getting image urls out of html page
Thanks, I guess that can return the first URL, then I can loop through the HTML to find the rest.
Thanks for the idea.
Andy
Thanks for the idea.
Andy