Android Code Snippet [B4X] Change <img src='...' paths in HTML document

Discussion in 'Code Snippets' started by fredo, Apr 30, 2019.

  1. fredo

    fredo Active Member Licensed User

    Code to inject/replace the path of the src attribute in all <img> tags of a HTML document.

    The routine became necessary because absolute path specifications in the data source had to be replaced with more current ones.

    Beware. This small solution is based on Regex.

    Parsing HTML documents with Regex is basically a bad idea and is better solved via XML transformation and analysis.

    However, if you need a small, manageable routine for relatively well-known HTML documents, you can take a first approach here.
    Code:
    Sub InjectPathIntoHtmlImageTags(HtmlString As String, PathToInject As StringAs String
        
    If HtmlString.IndexOf("<img ") < 1 Then
            
    Return HtmlString
        
    Else
            
    Dim sb As StringBuilder
            sb.Initialize
            
    Dim OldPath As String = ""
            
    Dim MatchWholeImgTag As Matcher
            
    Dim MatchFilename As Matcher
            
    Dim FoundFilenameString As String  = ""
            
    Dim ImageTagString As String  = ""
            
    Dim PositionImgTagStart As Int = 0
            
    Dim PositionImgTagEnd As Int = 0
            
    Dim PositionImgTagEndBefore As Int = 0
            
    If Not(PathToInject.EndsWith("/")) Then
                PathToInject = PathToInject & 
    "/"
            
    End If
            
            
    ' B4X Regex quick tester --> https://b4x.com:51041/regex_ws/index.html
            MatchWholeImgTag = Regex.Matcher("<img[^>]* src=[^>]*>", HtmlString)  ' Find WHOLE IMAGE TAG: <img src="...">
            Do While MatchWholeImgTag.find()
                ImageTagString = MatchWholeImgTag.Match  
    ' <img src="img1.png" width="96" height="96" >
                PositionImgTagStart = MatchWholeImgTag.GetStart(0)
                PositionImgTagEnd = MatchWholeImgTag.GetEnd(
    0)
                sb.Append(HtmlString.SubString2(PositionImgTagEndBefore, PositionImgTagStart))
                
    Dim RXOptions As Int = Regex.MULTILINE
            
                
    ' B4X Regex quick tester --> https://b4x.com:51041/regex_ws/index.html
                MatchFilename = Regex.Matcher2($"<img.*?src="([^"]+)".*?>"$, RXOptions, ImageTagString)    ' Find the FILENAME in src --> https://regex101.com/r/eEyf5S/2
                If MatchFilename.Find Then
                    FoundFilenameString = MatchFilename.Group(
    1)
                    
    If PathToInject = "{noimage}" Then
                        ImageTagString = 
    "⬜"
                    
    Else
                        
    Dim p1 As Int = ImageTagString.IndexOf($"src="$)
                        
    Dim p2 As Int = ImageTagString.IndexOf2($"""$, p1 +2)
                        
    Dim LeftPart As String = ImageTagString.SubString2(0, p1)
                        
    Dim RightPart As String = ImageTagString.SubString(p2 +1)
                        
    If FoundFilenameString.Contains("/"Then
                            OldPath = FoundFilenameString.SubString2(
    0, FoundFilenameString.LastIndexOf("/") +1)
                            ImageTagString = 
    $"${LeftPart}src="${PathToInject}${RightPart.Replace(OldPath,"")}"$
                        
    Else
                            ImageTagString = 
    $"${LeftPart}src="${PathToInject}${RightPart}"$
                        
    End If
                    
    End If
                
    End If
                
                sb.Append(ImageTagString)
                PositionImgTagEndBefore = PositionImgTagEnd
            
    Loop
            sb.Append(HtmlString.SubString(PositionImgTagEndBefore))
            
    Return sb.ToString
        
    End If
    End Sub
    Note: The performance can certainly be greatly improved by code optimizations. Here the code is a bit inflated for the sake of traceability.

    Test:
    Code:
    Sub Test1
        
    Dim TestLines As List
        TestLines.Initialize
        TestLines.Add(
    $"yyy zzz"$)
        TestLines.Add(
    $"aa <img src="image1.jpg" alt="1xyz"> bb"$)
        TestLines.Add(
    $"cc <img src="/oooo/pppp/image2.jpg" alt="2xyz"> dd"$)
        TestLines.Add(
    $"ee <img src="image3.jpg" alt="3xyz"> ff <img src="imgage4.jpg" alt="4 xxyyzz"> gg"$)
        TestLines.Add(
    $"hh <img src="/zzz/image5.jpg" alt="5yyyy"> ii jj <img src="/yyy/image6.jpg" alt="6xyz"> kk"$)
        

        
    For Each x As String In TestLines
            
    Log("#-")
            
    Log("#- --------- --------- --------- ---------")
            
    Log("#- Input =" & x)
            
    Log("#- Result=" & InjectPathIntoHtmlImageTags(x, "/newpath1/subpathxy/"))
        
    Next
    end sub

    Output:
     
  2. zhonghua

    zhonghua Member Licensed User

    Nice code!
     
    fredo likes this.
Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice