Code to inject/replace the path of the src attribute in all <img> tags of a HTML document.
The routine became necessary because absolute path specifications in the data source had to be replaced with more current ones.
However, if you need a small, manageable routine for relatively well-known HTML documents, you can take a first approach here.
Test:
Output:
The routine became necessary because absolute path specifications in the data source had to be replaced with more current ones.
Beware. This small solution is based on Regex.
Parsing HTML documents with Regex is basically a bad idea and is better solved via XML transformation and analysis.
Parsing HTML documents with Regex is basically a bad idea and is better solved via XML transformation and analysis.
However, if you need a small, manageable routine for relatively well-known HTML documents, you can take a first approach here.
B4X:
Sub InjectPathIntoHtmlImageTags(HtmlString As String, PathToInject As String) As String
If HtmlString.IndexOf("<img ") < 1 Then
Return HtmlString
Else
Dim sb As StringBuilder
sb.Initialize
Dim OldPath As String = ""
Dim MatchWholeImgTag As Matcher
Dim MatchFilename As Matcher
Dim FoundFilenameString As String = ""
Dim ImageTagString As String = ""
Dim PositionImgTagStart As Int = 0
Dim PositionImgTagEnd As Int = 0
Dim PositionImgTagEndBefore As Int = 0
If Not(PathToInject.EndsWith("/")) Then
PathToInject = PathToInject & "/"
End If
' B4X Regex quick tester --> https://b4x.com:51041/regex_ws/index.html
MatchWholeImgTag = Regex.Matcher("<img[^>]* src=[^>]*>", HtmlString) ' Find WHOLE IMAGE TAG: <img src="...">
Do While MatchWholeImgTag.find()
ImageTagString = MatchWholeImgTag.Match ' <img src="img1.png" width="96" height="96" >
PositionImgTagStart = MatchWholeImgTag.GetStart(0)
PositionImgTagEnd = MatchWholeImgTag.GetEnd(0)
sb.Append(HtmlString.SubString2(PositionImgTagEndBefore, PositionImgTagStart))
Dim RXOptions As Int = Regex.MULTILINE
' B4X Regex quick tester --> https://b4x.com:51041/regex_ws/index.html
MatchFilename = Regex.Matcher2($"<img.*?src="([^"]+)".*?>"$, RXOptions, ImageTagString) ' Find the FILENAME in src --> https://regex101.com/r/eEyf5S/2
If MatchFilename.Find Then
FoundFilenameString = MatchFilename.Group(1)
If PathToInject = "{noimage}" Then
ImageTagString = "⬜"
Else
Dim p1 As Int = ImageTagString.IndexOf($"src="$)
Dim p2 As Int = ImageTagString.IndexOf2($"""$, p1 +2)
Dim LeftPart As String = ImageTagString.SubString2(0, p1)
Dim RightPart As String = ImageTagString.SubString(p2 +1)
If FoundFilenameString.Contains("/") Then
OldPath = FoundFilenameString.SubString2(0, FoundFilenameString.LastIndexOf("/") +1)
ImageTagString = $"${LeftPart}src="${PathToInject}${RightPart.Replace(OldPath,"")}"$
Else
ImageTagString = $"${LeftPart}src="${PathToInject}${RightPart}"$
End If
End If
End If
sb.Append(ImageTagString)
PositionImgTagEndBefore = PositionImgTagEnd
Loop
sb.Append(HtmlString.SubString(PositionImgTagEndBefore))
Return sb.ToString
End If
End Sub
Note: The performance can certainly be greatly improved by code optimizations. Here the code is a bit inflated for the sake of traceability.
Test:
B4X:
Sub Test1
Dim TestLines As List
TestLines.Initialize
TestLines.Add($"yyy zzz"$)
TestLines.Add($"aa <img src="image1.jpg" alt="1xyz"> bb"$)
TestLines.Add($"cc <img src="/oooo/pppp/image2.jpg" alt="2xyz"> dd"$)
TestLines.Add($"ee <img src="image3.jpg" alt="3xyz"> ff <img src="imgage4.jpg" alt="4 xxyyzz"> gg"$)
TestLines.Add($"hh <img src="/zzz/image5.jpg" alt="5yyyy"> ii jj <img src="/yyy/image6.jpg" alt="6xyz"> kk"$)
For Each x As String In TestLines
Log("#-")
Log("#- --------- --------- --------- ---------")
Log("#- Input =" & x)
Log("#- Result=" & InjectPathIntoHtmlImageTags(x, "/newpath1/subpathxy/"))
Next
end sub
Output:
#-
#- --------- --------- --------- ---------
#- Input =yyy zzz
#- Result=yyy zzz
#-
#- --------- --------- --------- ---------
#- Input =aa <img src="image1.jpg" alt="1xyz"> bb
#- Result=aa <img src="/newpath1/subpathxy/image1.jpg" alt="1xyz"> bb
#-
#- --------- --------- --------- ---------
#- Input =cc <img src="/oooo/pppp/image2.jpg" alt="2xyz"> dd
#- Result=cc <img src="/newpath1/subpathxy/image2.jpg" alt="2xyz"> dd
#-
#- --------- --------- --------- ---------
#- Input =ee <img src="image3.jpg" alt="3xyz"> ff <img src="imgage4.jpg" alt="4 xxyyzz"> gg
#- Result=ee <img src="/newpath1/subpathxy/image3.jpg" alt="3xyz"> ff <img src="/newpath1/subpathxy/imgage4.jpg" alt="4 xxyyzz"> gg
#-
#- --------- --------- --------- ---------
#- Input =hh <img src="/zzz/image5.jpg" alt="5yyyy"> ii jj <img src="/yyy/image6.jpg" alt="6xyz"> kk
#- Result=hh <img src="/newpath1/subpathxy/image5.jpg" alt="5yyyy"> ii jj <img src="/newpath1/subpathxy/image6.jpg" alt="6xyz"> kk