Android Code Snippet [B4X] Change <img src='...' paths in HTML document

fredo

Well-Known Member
Licensed User
Code to inject/replace the path of the src attribute in all <img> tags of a HTML document.

The routine became necessary because absolute path specifications in the data source had to be replaced with more current ones.

Beware. This small solution is based on Regex.

Parsing HTML documents with Regex is basically a bad idea and is better solved via XML transformation and analysis.

However, if you need a small, manageable routine for relatively well-known HTML documents, you can take a first approach here.
B4X:
Sub InjectPathIntoHtmlImageTags(HtmlString As String, PathToInject As String) As String
    If HtmlString.IndexOf("<img ") < 1 Then
        Return HtmlString
    Else
        Dim sb As StringBuilder
        sb.Initialize
        Dim OldPath As String = ""
        Dim MatchWholeImgTag As Matcher
        Dim MatchFilename As Matcher
        Dim FoundFilenameString As String  = ""
        Dim ImageTagString As String  = ""
        Dim PositionImgTagStart As Int = 0
        Dim PositionImgTagEnd As Int = 0
        Dim PositionImgTagEndBefore As Int = 0
        If Not(PathToInject.EndsWith("/")) Then
            PathToInject = PathToInject & "/"
        End If
        
        ' B4X Regex quick tester --> https://b4x.com:51041/regex_ws/index.html
        MatchWholeImgTag = Regex.Matcher("<img[^>]* src=[^>]*>", HtmlString)  ' Find WHOLE IMAGE TAG: <img src="...">
        Do While MatchWholeImgTag.find()
            ImageTagString = MatchWholeImgTag.Match  ' <img src="img1.png" width="96" height="96" >
            PositionImgTagStart = MatchWholeImgTag.GetStart(0)
            PositionImgTagEnd = MatchWholeImgTag.GetEnd(0)
            sb.Append(HtmlString.SubString2(PositionImgTagEndBefore, PositionImgTagStart))
            Dim RXOptions As Int = Regex.MULTILINE
        
            ' B4X Regex quick tester --> https://b4x.com:51041/regex_ws/index.html
            MatchFilename = Regex.Matcher2($"<img.*?src="([^"]+)".*?>"$, RXOptions, ImageTagString)    ' Find the FILENAME in src --> https://regex101.com/r/eEyf5S/2
            If MatchFilename.Find Then
                FoundFilenameString = MatchFilename.Group(1)
                If PathToInject = "{noimage}" Then
                    ImageTagString = "⬜"
                Else
                    Dim p1 As Int = ImageTagString.IndexOf($"src="$)
                    Dim p2 As Int = ImageTagString.IndexOf2($"""$, p1 +2)
                    Dim LeftPart As String = ImageTagString.SubString2(0, p1)
                    Dim RightPart As String = ImageTagString.SubString(p2 +1)
                    If FoundFilenameString.Contains("/") Then
                        OldPath = FoundFilenameString.SubString2(0, FoundFilenameString.LastIndexOf("/") +1)
                        ImageTagString = $"${LeftPart}src="${PathToInject}${RightPart.Replace(OldPath,"")}"$
                    Else
                        ImageTagString = $"${LeftPart}src="${PathToInject}${RightPart}"$
                    End If
                End If
            End If
            
            sb.Append(ImageTagString)
            PositionImgTagEndBefore = PositionImgTagEnd
        Loop
        sb.Append(HtmlString.SubString(PositionImgTagEndBefore))
        Return sb.ToString
    End If
End Sub
Note: The performance can certainly be greatly improved by code optimizations. Here the code is a bit inflated for the sake of traceability.

Test:
B4X:
Sub Test1
    Dim TestLines As List
    TestLines.Initialize
    TestLines.Add($"yyy zzz"$)
    TestLines.Add($"aa <img src="image1.jpg" alt="1xyz"> bb"$)
    TestLines.Add($"cc <img src="/oooo/pppp/image2.jpg" alt="2xyz"> dd"$)
    TestLines.Add($"ee <img src="image3.jpg" alt="3xyz"> ff <img src="imgage4.jpg" alt="4 xxyyzz"> gg"$)
    TestLines.Add($"hh <img src="/zzz/image5.jpg" alt="5yyyy"> ii jj <img src="/yyy/image6.jpg" alt="6xyz"> kk"$)
    

    For Each x As String In TestLines
        Log("#-")
        Log("#- --------- --------- --------- ---------")
        Log("#- Input =" & x)
        Log("#- Result=" & InjectPathIntoHtmlImageTags(x, "/newpath1/subpathxy/"))
    Next
end sub

Output:
#-
#- --------- --------- --------- ---------
#- Input =yyy zzz
#- Result=yyy zzz
#-
#- --------- --------- --------- ---------
#- Input =aa <img src="image1.jpg" alt="1xyz"> bb
#- Result=aa <img src="/newpath1/subpathxy/image1.jpg" alt="1xyz"> bb
#-
#- --------- --------- --------- ---------
#- Input =cc <img src="/oooo/pppp/image2.jpg" alt="2xyz"> dd
#- Result=cc <img src="/newpath1/subpathxy/image2.jpg" alt="2xyz"> dd
#-
#- --------- --------- --------- ---------
#- Input =ee <img src="image3.jpg" alt="3xyz"> ff <img src="imgage4.jpg" alt="4 xxyyzz"> gg
#- Result=ee <img src="/newpath1/subpathxy/image3.jpg" alt="3xyz"> ff <img src="/newpath1/subpathxy/imgage4.jpg" alt="4 xxyyzz"> gg
#-
#- --------- --------- --------- ---------
#- Input =hh <img src="/zzz/image5.jpg" alt="5yyyy"> ii jj <img src="/yyy/image6.jpg" alt="6xyz"> kk
#- Result=hh <img src="/newpath1/subpathxy/image5.jpg" alt="5yyyy"> ii jj <img src="/newpath1/subpathxy/image6.jpg" alt="6xyz"> kk
 
Top