How to strip invalid XML characters

naruto

Member
Licensed User
Longtime User
I have inside my sqlite database invalid characters which I need to remove before I am able to use XML-Builder to generate nice looking xml and being able to parse it again with SAX.

According to the pages I have read valid UTF-8 characters should be the following bytes
0x9 | 0xA | 0xD | [0x20-0xD7FF] | [0xE000-0xFFFD] | [0x10000-0x10FFFF]

I found an example (see the link below) which I tried to rewrite in basic4android, but my Chinese characters are being stripped and normal characters are left behind. Would be nice if it would take the whole utf-8 range instead of a-z.

java - How to encode characters from Oracle to XML? - Stack Overflow

By the way XML-Builder doesn't treat these characters nicely, reason I am in need of a function which strips bad characters.

my code
B4X:
Sub XmlCharacterWhitelist(in_string As String) As String
    If( in_string == Null ) Then
      Return Null
   End If

   Dim data() As Byte : data = in_string.GetBytes("UTF8")
      
    Dim sbOutput As StringBuilder : sbOutput.Initialize
    Dim ch As Byte

    For i  = 0 To data.Length - 1
        ch = data(i)
   
        If ((ch >= 0x0020 AND ch <= 0xD7FF ) OR _
                (ch >= 0xE000 AND ch <= 0xFFFD ) OR _
            (ch >= 0x10000 AND ch <= 0x10FFFF) OR _
                ch == 0x0009 OR _
                ch == 0x000A OR  _
                ch == 0x000D ) Then
         
            sbOutput.Append(BytesToString(data, i, 1, "UTF8"))
        End If
    Next
   
    Return sbOutput.ToString
End Sub

Any help is appreciated as I have been fighting with this for some days now.

Looking forward to any replies.
 

naruto

Member
Licensed User
Longtime User
Attached the requested UTF-8 text file with bad characters. Now the big question is how to escape or at least remove the bad characters while leaving the Chinese characters in between. Meaning like in my above post, allow only the range of valid utf characters.

Thanks for your time.
 

Attachments

  • badtext-utf8.txt
    115 bytes · Views: 261
Upvote 0

naruto

Member
Licensed User
Longtime User
Thanks for the suggestions, but I wanted to keep a readable text file if possible.

I finally managed to solve the strip_invalid_xml_characters function. For those who are interested please see the function below. Right now the function simply skips the invalid character, depending your needs instead of skipping the invalid character, you could replace it with something else.

B4X:
' We are removing utf-8 characters that may not appear in well-formed XML documents.
' see W3C - Extensible Markup Language (XML) 1.1 (Second Edition) for valid character ranges
' http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets
Sub XmlCharacterWhitelist(in_string As String) As String
    If( in_string == Null OR in_string == "") Then
      Return Null
   End If

    Dim sbOutput As StringBuilder : sbOutput.Initialize ' stirng builder to keep our valid characters
    Dim ch As Char ' A char is 16 bits in Java (and is also the only unsigned type!!)
   
    For i  = 0 To in_string.Length - 1
        ch = in_string.CharAt(i)
      Dim ch_number As Int : ch_number = Asc(ch) ' chr asc
   
      ' any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
        If ((ch_number >= 0x0001 AND ch_number <= 0xD7FF ) OR _
                (ch_number >= 0xE000 AND ch_number <= 0xFFFD ) OR _
            (ch_number >= 0x10000 AND ch_number <= 0x10FFFF) OR _
                ch_number == 0x0009 OR _
                ch_number == 0x000A OR  _
                ch_number == 0x000D ) Then
            
            ' Restricted Chars
            If ((ch_number < 0x1  OR ch_number > 0x8) AND _
               (ch_number < 0xB  OR ch_number > 0xC)      AND _
               (ch_number < 0xE  OR ch_number > 0x1F)   AND _
               (ch_number < 0x7F OR ch_number > 0x84)   AND _
               (ch_number < 0x86 OR ch_number > 0x9F)) Then
                  sbOutput.Append(ch)
            End If
        End If
    Next
   
    Return sbOutput.ToString
End Sub
 
Upvote 0
Top