B4J Question xmlsax cannot parse correctly when elements and text are mixed

xulihang · Aug 12, 2020

I am trying to parse xliff which is a kind of xml for localization.

There will be elements like: <g id="1">Styles like <g id="2">bold</g> is supported.</g>

I use xmlsax to parse it and I cannot get the first text part "Styles like " in the _EndElement event.

The result using xml2map:

B4X:

(MyMap) {g={Attributes={id=1}, g={Attributes={id=2}, Text=bold}}}

I try to use jsoup to parse it and it has the right result:

B4X:

(TextNode)
Styles like
(Element) <g id="2">
bold
</g>
(TextNode)
bold
(TextNode)  is supported.

Is xmlsax wrapped correctly? I prefer to use xml2map which parses xml to map.

xulihang · Aug 12, 2020

I checked the wrapper's source code. I think there should be a characters event to handle text.

Java:

        @Override
        public void characters (char ch[], int start, int length)
        throws SAXException
        {
            String str = new String(ch, start, length);
            sb.append(str);
            ba.raiseEvent2(null, true,charactersEvent ,false, str);
        }
        }

Erel · Aug 12, 2020

The full string should be added. You are right that it will only be raised at the end.

xulihang · Aug 12, 2020

So how to handle this situation?

I have tried to create a Name key to store the name and a children key to store these parts in sequence. But as a result, it cannot be converted back to xml using map2xml.

Map:

B4X:

(MyMap) {g={Attributes={id=1}, Name=g, children=[Styles like , {Attributes={id=2}, Name=g, children=bold},  is supported.]}}

Erel · Aug 12, 2020

This is indeed a document that is not suitable for XmlSax. Can you upload a full XML example?

xulihang · Aug 12, 2020

I have attached a lite version of the entire xml which is generated by okapi tikal from a docx file.

Erel · Aug 12, 2020

1. Such XML cannot be represented with a map so it will never work with Xml2Map.

2. You can parse it with MiniHtmlParser. I don't know whether it will be simpler than jSoup.

B4X:

Dim parser As MiniHtmlParser
parser.Initialize
Dim root As HtmlNode = parser.Parse(File.ReadString("C:\Users\H\Downloads\321880_ms.docx.xlf", ""))
parser.PrintNode(root)

xulihang · Aug 12, 2020

MiniHtmlParser is a better solution. I will try to use it in my scenario.

An xml writer is still needed to convert modified existing xml node to string.

Erel · Aug 12, 2020

It should be possible to write it with XmlBuilder.

xulihang · Aug 12, 2020

I found that after a text node is added, XMLBuilder no longer allows adding an element, which will throw an IllegalStateException (see: assertElementContainsNoOrWhitespaceOnlyTextNodes) . I have posted an issue about this.

xulihang · Aug 13, 2020

I tried MiniHtmlParser and it cannot parse <source xml:lang="en"><g id="1">[Global Notes:</g></source> correcly.

I made a new xmlparser based on a new jXmlSax.jar which raises characters event.

I also wrapped javax.xml.parsers.DocumentBuilder to build xml.

Files are attatched.

Erel · Aug 13, 2020

xulihang said:
tried MiniHtmlParser and it cannot parse <source xml:lang="en"><g id="1">[Global Notes:</g></source> correcly.

Not sure that you are correct.

Tested with this:

B4X:

Dim root As HtmlNode = parser.Parse($"<source xml:lang="en"><g id="1">[Global Notes:</g></source>"$)
parser.PrintNode(root)

The output looks correct.

xulihang · Aug 13, 2020

B4X:

*** root ***
 *** source ***
 |lang: en|
 *** g ***
 |id: 1|
  *** text ***
  |value: [Global Notes:|

The result I got is like this.

Erel · Aug 13, 2020

Looks correct (the attribute xml:lang was renamed to lang but this is probably not the problem that you are talking about).

What is the problem?

xulihang · Aug 13, 2020

The source element should be the parent of the g element.

The result my xmlbuilder gives from the node is like this: <root><source lang="en"/><g id="1">[Global Notes:</g></root>

Erel · Aug 13, 2020

I see. This is indeed not a good use case for MiniHtmlParser as it is not a valid html text.

The source element must be empty: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/source

This code sets the elements that will be closed automatically:

B4X:

VoidTags = B4XCollections.CreateSet2(Array("!DOCTYPE", "area", "base", "br", "col", _
        "embed", "hr", "img", "input", "link", "meta", "param", "source", "track", "wbr"))

B4J Question xmlsax cannot parse correctly when elements and text are mixed

xulihang

Active Member

xulihang

Active Member

Erel

B4X founder

xulihang

Active Member

Erel

B4X founder

xulihang

Active Member

Attachments

Erel

B4X founder

xulihang

Active Member

Erel

B4X founder

xulihang

Active Member

xulihang

Active Member

Attachments

Erel

B4X founder

xulihang

Active Member

Erel

B4X founder

xulihang

Active Member

Erel

B4X founder

Similar Threads