Android Code Snippet [B4X] Code to extract the paths of the src attribute in all <img> tags of a HTML document

Discussion in 'Code Snippets' started by fredo, May 22, 2019.

  1. fredo

    fredo Active Member Licensed User

    Code to extract the paths of the src attribute in all <img> tags of a HTML document.

    The routine became necessary because a list of all used images was needed.

    Beware. This small solution is based on Regex. Parsing HTML documents with Regex is basically a bad idea and is better solved via XML transformation and analysis.

    However, if you need a small, manageable routine for relatively well-known HTML documents, you can take a first approach here.

    Code:
    Sub GetHtmlImagesList(HtmlString As StringAs List
        
    Dim ReturnList As List
        ReturnList.Initialize
        
    If HtmlString.IndexOf("<img ") < 1 Then
            
    Return ReturnList
        
    Else
            
    Dim MatchWholeImgTag As Matcher
            
    Dim MatchFilename As Matcher
            
    Dim FoundFilenameString As String  = ""
            
    Dim ImageTagString As String  = ""
            MatchWholeImgTag = 
    Regex.Matcher("<img[^>]* src=[^>]*>", HtmlString)  ' Find WHOLE IMAGE TAG:    <img src="...">
            Do While MatchWholeImgTag.find()
                ImageTagString = MatchWholeImgTag.Match  
    ' <img src="img1.png" width="96" height="96" >
                Dim RXOptions As Int = Regex.MULTILINE
                MatchFilename = 
    Regex.Matcher2($"<img.*?src="([^"]+)".*?>"$, RXOptions, ImageTagString)    ' Find the FILENAME in src --> https://regex101.com/r/eEyf5S/2
                If MatchFilename.Find Then
                    FoundFilenameString = MatchFilename.Group(
    1)
                    ReturnList.add(FoundFilenameString)
                
    End If
            
    Loop
            
    Return ReturnList
        
    End If
    End Sub
    Note: The performance can certainly be greatly improved by code optimizations. Here the code is a bit inflated for the sake of traceability.
    A small Testproject is attached.
    2019-05-22_14-38-28.jpg
     

    Attached Files:

Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice