Android Code Snippet [B4X] Change <img src='...' paths in HTML document


Well-Known Member
Licensed User
Code to inject/replace the path of the src attribute in all <img> tags of a HTML document.

The routine became necessary because absolute path specifications in the data source had to be replaced with more current ones.

Beware. This small solution is based on Regex.

Parsing HTML documents with Regex is basically a bad idea and is better solved via XML transformation and analysis.

However, if you need a small, manageable routine for relatively well-known HTML documents, you can take a first approach here.
Sub InjectPathIntoHtmlImageTags(HtmlString As String, PathToInject As String) As String
    If HtmlString.IndexOf("<img ") < 1 Then
        Return HtmlString
        Dim sb As StringBuilder
        Dim OldPath As String = ""
        Dim MatchWholeImgTag As Matcher
        Dim MatchFilename As Matcher
        Dim FoundFilenameString As String  = ""
        Dim ImageTagString As String  = ""
        Dim PositionImgTagStart As Int = 0
        Dim PositionImgTagEnd As Int = 0
        Dim PositionImgTagEndBefore As Int = 0
        If Not(PathToInject.EndsWith("/")) Then
            PathToInject = PathToInject & "/"
        End If
        ' B4X Regex quick tester -->
        MatchWholeImgTag = Regex.Matcher("<img[^>]* src=[^>]*>", HtmlString)  ' Find WHOLE IMAGE TAG: <img src="...">
        Do While MatchWholeImgTag.find()
            ImageTagString = MatchWholeImgTag.Match  ' <img src="img1.png" width="96" height="96" >
            PositionImgTagStart = MatchWholeImgTag.GetStart(0)
            PositionImgTagEnd = MatchWholeImgTag.GetEnd(0)
            sb.Append(HtmlString.SubString2(PositionImgTagEndBefore, PositionImgTagStart))
            Dim RXOptions As Int = Regex.MULTILINE
            ' B4X Regex quick tester -->
            MatchFilename = Regex.Matcher2($"<img.*?src="([^"]+)".*?>"$, RXOptions, ImageTagString)    ' Find the FILENAME in src -->
            If MatchFilename.Find Then
                FoundFilenameString = MatchFilename.Group(1)
                If PathToInject = "{noimage}" Then
                    ImageTagString = "⬜"
                    Dim p1 As Int = ImageTagString.IndexOf($"src="$)
                    Dim p2 As Int = ImageTagString.IndexOf2($"""$, p1 +2)
                    Dim LeftPart As String = ImageTagString.SubString2(0, p1)
                    Dim RightPart As String = ImageTagString.SubString(p2 +1)
                    If FoundFilenameString.Contains("/") Then
                        OldPath = FoundFilenameString.SubString2(0, FoundFilenameString.LastIndexOf("/") +1)
                        ImageTagString = $"${LeftPart}src="${PathToInject}${RightPart.Replace(OldPath,"")}"$
                        ImageTagString = $"${LeftPart}src="${PathToInject}${RightPart}"$
                    End If
                End If
            End If
            PositionImgTagEndBefore = PositionImgTagEnd
        Return sb.ToString
    End If
End Sub
Note: The performance can certainly be greatly improved by code optimizations. Here the code is a bit inflated for the sake of traceability.

Sub Test1
    Dim TestLines As List
    TestLines.Add($"yyy zzz"$)
    TestLines.Add($"aa <img src="image1.jpg" alt="1xyz"> bb"$)
    TestLines.Add($"cc <img src="/oooo/pppp/image2.jpg" alt="2xyz"> dd"$)
    TestLines.Add($"ee <img src="image3.jpg" alt="3xyz"> ff <img src="imgage4.jpg" alt="4 xxyyzz"> gg"$)
    TestLines.Add($"hh <img src="/zzz/image5.jpg" alt="5yyyy"> ii jj <img src="/yyy/image6.jpg" alt="6xyz"> kk"$)

    For Each x As String In TestLines
        Log("#- --------- --------- --------- ---------")
        Log("#- Input =" & x)
        Log("#- Result=" & InjectPathIntoHtmlImageTags(x, "/newpath1/subpathxy/"))
end sub

#- --------- --------- --------- ---------
#- Input =yyy zzz
#- Result=yyy zzz
#- --------- --------- --------- ---------
#- Input =aa <img src="image1.jpg" alt="1xyz"> bb
#- Result=aa <img src="/newpath1/subpathxy/image1.jpg" alt="1xyz"> bb
#- --------- --------- --------- ---------
#- Input =cc <img src="/oooo/pppp/image2.jpg" alt="2xyz"> dd
#- Result=cc <img src="/newpath1/subpathxy/image2.jpg" alt="2xyz"> dd
#- --------- --------- --------- ---------
#- Input =ee <img src="image3.jpg" alt="3xyz"> ff <img src="imgage4.jpg" alt="4 xxyyzz"> gg
#- Result=ee <img src="/newpath1/subpathxy/image3.jpg" alt="3xyz"> ff <img src="/newpath1/subpathxy/imgage4.jpg" alt="4 xxyyzz"> gg
#- --------- --------- --------- ---------
#- Input =hh <img src="/zzz/image5.jpg" alt="5yyyy"> ii jj <img src="/yyy/image6.jpg" alt="6xyz"> kk
#- Result=hh <img src="/newpath1/subpathxy/image5.jpg" alt="5yyyy"> ii jj <img src="/newpath1/subpathxy/image6.jpg" alt="6xyz"> kk