Code to extract the paths of the src attribute in all <img> tags of a HTML document.
The routine became necessary because a list of all used images was needed.
However, if you need a small, manageable routine for relatively well-known HTML documents, you can take a first approach here.
Note: The performance can certainly be greatly improved by code optimizations. Here the code is a bit inflated for the sake of traceability.
A small Testproject is attached.
The routine became necessary because a list of all used images was needed.
Beware. This small solution is based on Regex. Parsing HTML documents with Regex is basically a bad idea and is better solved via XML transformation and analysis.
However, if you need a small, manageable routine for relatively well-known HTML documents, you can take a first approach here.
B4X:
Sub GetHtmlImagesList(HtmlString As String) As List
Dim ReturnList As List
ReturnList.Initialize
If HtmlString.IndexOf("<img ") < 1 Then
Return ReturnList
Else
Dim MatchWholeImgTag As Matcher
Dim MatchFilename As Matcher
Dim FoundFilenameString As String = ""
Dim ImageTagString As String = ""
MatchWholeImgTag = Regex.Matcher("<img[^>]* src=[^>]*>", HtmlString) ' Find WHOLE IMAGE TAG: <img src="...">
Do While MatchWholeImgTag.find()
ImageTagString = MatchWholeImgTag.Match ' <img src="img1.png" width="96" height="96" >
Dim RXOptions As Int = Regex.MULTILINE
MatchFilename = Regex.Matcher2($"<img.*?src="([^"]+)".*?>"$, RXOptions, ImageTagString) ' Find the FILENAME in src --> https://regex101.com/r/eEyf5S/2
If MatchFilename.Find Then
FoundFilenameString = MatchFilename.Group(1)
ReturnList.add(FoundFilenameString)
End If
Loop
Return ReturnList
End If
End Sub
Note: The performance can certainly be greatly improved by code optimizations. Here the code is a bit inflated for the sake of traceability.