B4J Question xmlsax cannot parse correctly when elements and text are mixed

xulihang

Active Member
Licensed User
Longtime User
I am trying to parse xliff which is a kind of xml for localization.

There will be elements like: <g id="1">Styles like <g id="2">bold</g> is supported.</g>

I use xmlsax to parse it and I cannot get the first text part "Styles like " in the _EndElement event.

The result using xml2map:
B4X:
(MyMap) {g={Attributes={id=1}, g={Attributes={id=2}, Text=bold}}}

I try to use jsoup to parse it and it has the right result:

B4X:
(TextNode)
Styles like
(Element) <g id="2">
bold
</g>
(TextNode)
bold
(TextNode)  is supported.

Is xmlsax wrapped correctly? I prefer to use xml2map which parses xml to map.
 

xulihang

Active Member
Licensed User
Longtime User
I checked the wrapper's source code. I think there should be a characters event to handle text.


Java:
        @Override
        public void characters (char ch[], int start, int length)
        throws SAXException
        {
            String str = new String(ch, start, length);
            sb.append(str);
            ba.raiseEvent2(null, true,charactersEvent ,false, str);
        }
        }
 
Upvote 0

xulihang

Active Member
Licensed User
Longtime User
So how to handle this situation?

I have tried to create a Name key to store the name and a children key to store these parts in sequence. But as a result, it cannot be converted back to xml using map2xml.

Map:

B4X:
(MyMap) {g={Attributes={id=1}, Name=g, children=[Styles like , {Attributes={id=2}, Name=g, children=bold},  is supported.]}}
 
Upvote 0

xulihang

Active Member
Licensed User
Longtime User
I have attached a lite version of the entire xml which is generated by okapi tikal from a docx file.
 

Attachments

  • 321880_ms.docx.xlf.zip
    588 bytes · Views: 169
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
1. Such XML cannot be represented with a map so it will never work with Xml2Map.

2. You can parse it with MiniHtmlParser. I don't know whether it will be simpler than jSoup.

B4X:
Dim parser As MiniHtmlParser
parser.Initialize
Dim root As HtmlNode = parser.Parse(File.ReadString("C:\Users\H\Downloads\321880_ms.docx.xlf", ""))
parser.PrintNode(root)
 
Upvote 0

xulihang

Active Member
Licensed User
Longtime User
MiniHtmlParser is a better solution. I will try to use it in my scenario.

An xml writer is still needed to convert modified existing xml node to string.
 
Upvote 0

xulihang

Active Member
Licensed User
Longtime User
I tried MiniHtmlParser and it cannot parse <source xml:lang="en"><g id="1">[Global Notes:</g></source> correcly.

I made a new xmlparser based on a new jXmlSax.jar which raises characters event.

I also wrapped javax.xml.parsers.DocumentBuilder to build xml.

Files are attatched.
 

Attachments

  • jXmlSax.zip
    5.9 KB · Views: 148
  • source.zip
    3.1 KB · Views: 156
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
tried MiniHtmlParser and it cannot parse <source xml:lang="en"><g id="1">[Global Notes:</g></source> correcly.
Not sure that you are correct.

Tested with this:
B4X:
Dim root As HtmlNode = parser.Parse($"<source xml:lang="en"><g id="1">[Global Notes:</g></source>"$)
parser.PrintNode(root)
The output looks correct.
 
Upvote 0

xulihang

Active Member
Licensed User
Longtime User
B4X:
*** root ***
 *** source ***
 |lang: en|
 *** g ***
 |id: 1|
  *** text ***
  |value: [Global Notes:|

The result I got is like this.
 
Upvote 0

xulihang

Active Member
Licensed User
Longtime User
The source element should be the parent of the g element.

The result my xmlbuilder gives from the node is like this: <root><source lang="en"/><g id="1">[Global Notes:</g></root>
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
I see. This is indeed not a good use case for MiniHtmlParser as it is not a valid html text.

The source element must be empty: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/source

This code sets the elements that will be closed automatically:
B4X:
VoidTags = B4XCollections.CreateSet2(Array("!DOCTYPE", "area", "base", "br", "col", _
        "embed", "hr", "img", "input", "link", "meta", "param", "source", "track", "wbr"))
 
Upvote 0
Top