Page 1 of 1

Getting image urls out of html page

Posted: Sun Sep 17, 2023 9:54 pm
by andyh1234
Hi,

Does anyone have any code to strip image urls out of a html page and put them in an array.

Looking to write a small piece of code to be able to download all of the images on a patricular (customisable) web page, I can write the code to download them but am struggling on parsing the html to extract just the image src tags.

Re: Getting image urls out of html page

Posted: Sun Sep 17, 2023 10:32 pm
by stam
Sounds like a job that could be done with regex?

John Gruber (inventor of Markdown and blog writer worth following) has a post on just this: https://daringfireball.net/2009/11/libe ... ching_urls

EDIT: regex pasted here:

Code: Select all

 \b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
Apparently this picks up every known variation of URLs, including some wild ones I was unaware of. Some explanation of this in the linked post, for things like this I use https://regex101.com to analyse and test.

It should be possible to adapt to LC as this regex is fully PCRE compatible…

Re: Getting image urls out of html page

Posted: Mon Sep 18, 2023 1:34 pm
by andyh1234
Thanks, I guess that can return the first URL, then I can loop through the HTML to find the rest.

Thanks for the idea.

Andy