RegEx and 'replace'

peacemaker

Expert
Licensed User
Longtime User
Hi, all

The task is to strip HTML code, make a plain text from it.

There is a code on JavaScript:

B4X:
str = '**ANY HTML CONTENT HERE**';

str=str.replace(/<\s*br\/*>/gi, "\n");
str=str.replace(/<\s*a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<\s*\/*.+?>/ig, "\n");
str=str.replace(/ {2,}/gi, " ");
str=str.replace(/\n+\s*/gi, "\n\n");

How to make such "replace" using RegEx ?

Other variants of code.
 

Erel

B4X founder
Staff member
Licensed User
Longtime User
You can do it with the help of the Reflection library:
B4X:
Sub Activity_Create(FirstTime As Boolean)
   Log(RegexReplace("abc(d)(e)", "abcde", "$2 $1"))
End Sub

Sub RegexReplace(Pattern As String, Text As String, Replacement As String) As String
   Dim m As Matcher
   m = Regex.Matcher(Pattern, Text)
   Dim r As Reflector
   r.Target = m
   Return r.RunMethod2("replaceAll", Replacement, "java.lang.String")
End Sub
 
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
B4X:
Sub RegexReplace(Pattern As String, Text As String, Replacement As String) As String
    Dim m As Matcher
    m = Regex.Matcher(Pattern, Text)
    Dim r As Reflector
    r.Target = m
    Return r.RunMethod2("replaceAll", Replacement, "java.lang.String")
End Sub

Sub PlainText (HTML As String) As String
HTML = RegexReplace("/<\s*br\/*>/gi", HTML, CRLF)
HTML = RegexReplace("/<\s*a.*href=" & QUOTE & "(.*?)" & QUOTE & ".*>(.*?)<\/a>/gi", HTML, " $2 (Link->$1) ")
HTML = RegexReplace("/<\s*\/*.+?>/ig", HTML, CRLF)
HTML = RegexReplace("/ {2,}/gi", HTML, " ")
HTML = RegexReplace("/\n+\s*/gi", HTML, CRLF & CRLF)
Return HTML
End Sub
 
Upvote 0

NeoTechni

Well-Known Member
Licensed User
Longtime User
I use:

B4X:
Sub StripHTML(Text As String) As String
   Dim temp As Int ,temp2 As Int 
   Do While temp>-1 'remove GMAIL quote   <div class=3D"gmail_quote"> to <br>
      temp = Instr(Text, "<div class=3D" & vbQuote & "gmail_quote" & vbQuote & ">", 0)
      If temp =-1 Then temp = Instr(Text, "<div class=" & vbQuote & "gmail_quote" & vbQuote & ">",0)
      If temp >-1 Then'is gmail
         temp2 = Instr(Text, "</blockquote></div>", temp)
         If temp2=-1 Then
            temp=-1
         Else
            Text = Left(Text, temp) & Right(Text, Text.Length - (temp2+19))
         End If
      End If
   Loop
   Text = Text.Replace("<br>", CRLF).Replace("<p>", CRLF)
   temp=0
   Do While temp>-1'remove all HTML
      temp = Instr(Text, "<", 0)
      If temp>-1 Then
         temp2=Instr(Text,">", temp)
         If temp2=-1 Then
            temp=-1
         Else
            Text = Left(Text, temp) & Right(Text, Text.Length - (temp2+1))
         End If
      End If
   Loop   
   Do While Instr(Text, CRLF & CRLF,0)>-1 'remove double new lines
      Text=Text.Replace(CRLF & CRLF, CRLF)
   Loop
   Return Text.Trim'.Replace(STimer.ReplyWarning, "").Trim
End Sub

Sub Left(Text As String, Length As Long)As String 
   If Text.Length>0 AND Length>0 Then
      'If Length>Text.Length Then Length=Text.Length 
      Return Text.SubString2(0, Min(Text.Length,Length))
   End If
   Return ""
End Sub

Sub Right(Text As String, Length As Long) As String
   If Text.Length>0 AND Length>0 Then
      'If Length>Text.Length Then Length=Text.Length 
      Return Text.SubString(Text.Length-Min(Text.Length,Length))
   End If
   Return ""
End Sub
Sub Mid(Text As String, Start As Int, Length As Int) As String 
   If Length>0 AND Start>-1 AND Start< Text.Length Then Return Text.SubString2(Start,Start+Length)
End Sub
Sub Instr(Text As String, TextToFind As String, Start As Int) As Int
   Return Text.IndexOf2(TextToFind,Start)
End Sub
 
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
THANKS !
Very good result ! Work speed is not so important.

I've also added to your code:
a = a.Replace("&nbsp;", " ")

to be finally happy ;)

vbQuote = QUOTE
 
Upvote 0

sorex

Expert
Licensed User
Longtime User
I had to do something simular but don't like to add libraries when it's not really needed.

So I ended with this simple sub routine to get rid of ALL the html tags that are in the cdata xml fields.

log (striptags(htmldata))


B4X:
Sub striptags(t As String)
Dim subs() As String
Dim x As Int 
subs=Regex.Split(">",t)      
t=""
For x=0 To subs.Length -1
   If subs(x).IndexOf("<")>-1 Then
      t=t & subs(x).SubString2(0,subs(x).IndexOf("<"))
   Else 
      t=t & subs(x)
   End If
Next
Return t
End Sub
 
Upvote 0
Top