Getting image urls out of html page

Anything beyond the basics in using the LiveCode language. Share your handlers, functions and magic here.

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

Post Reply
andyh1234
Posts: 442
Joined: Mon Aug 13, 2007 4:44 pm
Location: Eccles UK
Contact:

Getting image urls out of html page

Post by andyh1234 » Sun Sep 17, 2023 9:54 pm

Hi,

Does anyone have any code to strip image urls out of a html page and put them in an array.

Looking to write a small piece of code to be able to download all of the images on a patricular (customisable) web page, I can write the code to download them but am struggling on parsing the html to extract just the image src tags.

stam
Posts: 2686
Joined: Sun Jun 04, 2006 9:39 pm
Location: London, UK

Re: Getting image urls out of html page

Post by stam » Sun Sep 17, 2023 10:32 pm

Sounds like a job that could be done with regex?

John Gruber (inventor of Markdown and blog writer worth following) has a post on just this: https://daringfireball.net/2009/11/libe ... ching_urls

EDIT: regex pasted here:

Code: Select all

 \b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
Apparently this picks up every known variation of URLs, including some wild ones I was unaware of. Some explanation of this in the linked post, for things like this I use https://regex101.com to analyse and test.

It should be possible to adapt to LC as this regex is fully PCRE compatible…

andyh1234
Posts: 442
Joined: Mon Aug 13, 2007 4:44 pm
Location: Eccles UK
Contact:

Re: Getting image urls out of html page

Post by andyh1234 » Mon Sep 18, 2023 1:34 pm

Thanks, I guess that can return the first URL, then I can loop through the HTML to find the rest.

Thanks for the idea.

Andy

Post Reply

Return to “Talking LiveCode”