Problem with XML-parsing and Turkish characters

moster67

Expert
Licensed User
Longtime User
I need some help with a problem I have and hopefully it can be resolved.

I am aware about the problem with Turkish-characters (see
Internationalizing Turkish: Dotted and Dotless Turkish Letter "I")

My problem is probably due to what mentioned in above link.

I am parsing an XML-file and at a certain point, I get an error as follows:

org.apache.harmony.xml.ExpatParser$ParseException: At line 273, column 102: not well-formed (invalid token)

The full error message is as follows:

org.apache.harmony.xml.ExpatParser$ParseException: At line 273, column 102: not well-formed (invalid token)
at org.apache.harmony.xml.ExpatParser.parseFragment(ExpatParser.java:515)
at org.apache.harmony.xml.ExpatParser.parseDocument(ExpatParser.java:474)
at org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:321)
at org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:279)
at anywheresoftware.b4a.objects.SaxParser.parse(SaxParser.java:78)
at anywheresoftware.b4a.objects.SaxParser.Parse(SaxParser.java:71)
at anywheresoftware.b4a.samples.xmlsax.main._activity_create(main.java:244)
at java.lang.reflect.Method.invokeNative(Native Method)
at java.lang.reflect.Method.invoke(Method.java:511)
at anywheresoftware.b4a.BA.raiseEvent2(BA.java:167)
at anywheresoftware.b4a.samples.xmlsax.main.afterFirstLayout(main.java:89)
at anywheresoftware.b4a.samples.xmlsax.main.access$100(main.java:16)
at anywheresoftware.b4a.samples.xmlsax.main$WaitForLayout.run(main.java:74)
at android.os.Handler.handleCallback(Handler.java:615)
at android.os.Handler.dispatchMessage(Handler.java:92)
at android.os.Looper.loop(Looper.java:137)
at android.app.ActivityThread.main(ActivityThread.java:4898)
at java.lang.reflect.Method.invokeNative(Native Method)
at java.lang.reflect.Method.invoke(Method.java:511)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1006)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:773)
at dalvik.system.NativeStart.main(Native Method)
org.apache.harmony.xml.ExpatParser$ParseException: At line 273, column 102: not well-formed (invalid token)

Unfortunately I cannot say beforehand the characters included in the XML-file. My app is used all over the world but this is the first time I got a report about this issue and in fact it was from Turkish users. I think this error could also happen with certain characters of the Lithuanian and Armenian alfabet.

Now, I am actually reading this file using httputils since I read it from a network-device using a specific API. However, in order to produce an example (see attachment) so I can demostrate the problem, I did a test-program where I am reading the XML-file locally but the error is the same.

Is it possible to resolve this issue?

I haven't tried but perhaps in the parser-sub I could use a Try and Catch and on error I can continue with next record but this does not actually resolve the problem. I would prefer to get it to work they way it should and to show respect to my Turkish users.

Hope someone can help me to resolve this problem so I can read these characters properly. Thanks.
 

Attachments

  • xmlTurkish.zip
    25.3 KB · Views: 265

NJDude

Expert
Licensed User
Longtime User
I could be wrong, but it seems you have an illegal character in line #273 in the XML.

The encoding for Turkish is ISO-8859-9, but even using that it fails on that line.

Look at the attached code, you'll see the modification I made and still doesn't work.

Also, look at THIS screenshot.
 

Attachments

  • Encoding.zip
    25.3 KB · Views: 326
Last edited:
Upvote 0

moster67

Expert
Licensed User
Longtime User
Thank you NJ,

I tried what you did as well but it did not work.

The XML-file is always in the format you see - only the content changes and in this Turkish case, it can include illegal characters.

I think it might be something related to the following jave-code found in the source-folder (although I am not sure)

public static String _parser_endelement(String _uri,String _name,anywheresoftware.b4a.keywords.StringBuilderWrapper _text) throws Exception{
//BA.debugLineNum = 35;BA.debugLine="Sub Parser_EndElement (Uri As String, Name As String, Text As StringBuilder)";
//BA.debugLineNum = 37;BA.debugLine="If parser.Parents.IndexOf(\"e2event\") > - 1 Then";
if (_parser.Parents.IndexOf((Object)("e2event"))>-1) {
//BA.debugLineNum = 39;BA.debugLine="If Name = \"e2eventid\" Then";
if ((_name).equals("e2eventid")) {
//BA.debugLineNum = 40;BA.debugLine="Log(\"e2eventid - \" & Text.ToString)";
anywheresoftware.b4a.keywords.Common.Log("e2eventid - "+_text.ToString());
}else if((_name).equals("e2eventtitle")) {
//BA.debugLineNum = 42;BA.debugLine="Log(\"e2eventtitle - \" & Text.ToString)'";
anywheresoftware.b4a.keywords.Common.Log("e2eventtitle - "+_text.ToString());
}else if((_name).equals("e2eventdescription")) {
//BA.debugLineNum = 44;BA.debugLine="Log(\"e2eventdescription - \" & Text.ToString)'";
anywheresoftware.b4a.keywords.Common.Log("e2eventdescription - "+_text.ToString());
}else if((_name).equals("e2eventdescriptionextended")) {
//BA.debugLineNum = 46;BA.debugLine="Log(\"e2eventdescriptionextended - \" & Text.ToString)'";
anywheresoftware.b4a.keywords.Common.Log("e2eventdescriptionextended - "+_text.ToString());
};
};

and in particular to the usage of _name).equals.

There must be a way to resolve this error. :BangHead:

Any other ideas?
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
It is not related to the code you posted and it is not related to the dotted I (believe me I'm familiar with this issue since basic4ppc v1.0 in 2005 ;) ).

The problem is that it is not a valid XML. One solution is to catch this error. Another solution is to create an InputFilter in a library that will filter invalid characters.
 
Upvote 0

moster67

Expert
Licensed User
Longtime User
It is not related to the code you posted and it is not related to the dotted I (believe me I'm familiar with this issue since basic4ppc v1.0 in 2005 ;) ).

The problem is that it is not a valid XML. One solution is to catch this error. Another solution is to create an InputFilter in a library that will filter invalid characters.

Sorry - I wrote my reply while you posted your first reply so I didn't see what you said.

Did you mean that it is not possible to use Try-Catch in the "Parser_EndElement sub" meaning that I cannot "skip" the incriminated record and continue reading next?
If yes, there is no way I can read those records?

Can you elaborate a little bit further about the InputFilter-library? Is there any built-in function in Java which I could use in such a library to check for invalid characters?
 
Upvote 0

moster67

Expert
Licensed User
Longtime User
Resolved

OK, I managed to find a good solution to handle my problem with illegal characters in the XML-file by using Try and Catch and two functions I found here in the forum.

Basically, in case there is a problem with the XML-file, it will be handled in Catch.
1) I read the xml-file again assigning it to a string variable
2) I pass the new string to a function which takes cares of deleting the undesired characters (see http://www.b4x.com/forum/basic4android-updates-questions/18219-how-strip-invalid-xml-characters.html#post105025)
3) then I convert the string back into a InputStream (see http://www.b4x.com/forum/basic4android-updates-questions/25820-string-inputstream-conversion.html#post149490)
4) now I parse the xml-file again and I am good :)

Thanks to NJ and Erel for helping me out. A big thank also to B4A-user naruto for his useful function.

I am attaching a new example where you can see above procedure which resolves my problem. Perhaps it might be useful for someone else.
 

Attachments

  • XML_handleIllegalChar.zip
    26 KB · Views: 302
Last edited:
Upvote 0
Top