Android Code Snippet [B4X] Code to extract the paths of the src attribute in all <img> tags of a HTML document

Code to extract the paths of the src attribute in all <img> tags of a HTML document.

The routine became necessary because a list of all used images was needed.

Beware. This small solution is based on Regex. Parsing HTML documents with Regex is basically a bad idea and is better solved via XML transformation and analysis.

However, if you need a small, manageable routine for relatively well-known HTML documents, you can take a first approach here.

B4X:
Sub GetHtmlImagesList(HtmlString As String) As List
    Dim ReturnList As List
    ReturnList.Initialize
    If HtmlString.IndexOf("<img ") < 1 Then
        Return ReturnList
    Else
        Dim MatchWholeImgTag As Matcher
        Dim MatchFilename As Matcher
        Dim FoundFilenameString As String  = ""
        Dim ImageTagString As String  = ""
        MatchWholeImgTag = Regex.Matcher("<img[^>]* src=[^>]*>", HtmlString)  ' Find WHOLE IMAGE TAG:    <img src="...">
        Do While MatchWholeImgTag.find()
            ImageTagString = MatchWholeImgTag.Match  ' <img src="img1.png" width="96" height="96" >
            Dim RXOptions As Int = Regex.MULTILINE
            MatchFilename = Regex.Matcher2($"<img.*?src="([^"]+)".*?>"$, RXOptions, ImageTagString)    ' Find the FILENAME in src --> https://regex101.com/r/eEyf5S/2
            If MatchFilename.Find Then
                FoundFilenameString = MatchFilename.Group(1)
                ReturnList.add(FoundFilenameString)
            End If
        Loop
        Return ReturnList
    End If
End Sub

Note: The performance can certainly be greatly improved by code optimizations. Here the code is a bit inflated for the sake of traceability.
A small Testproject is attached.
2019-05-22_14-38-28.jpg
 

Attachments

  • WebviewTest_ExtractImgSrcAttributes.zip
    3.5 KB · Views: 359
Top