Android Question How to clean invalid HTML to process with DOM parser?

bububrln

Member
Licensed User
Hello everyone,

I am a brandnew user and dare to ask my first question here. Please bear with me as I am a total beginner with all things regarding Android and B4A. :) I am trying to find my way into B4A by trying a few things out. Among them is this: I want to download a HTML file from the internet and process its contents within my app. The particular website I am interested in is a nightmare, code-wise: While it may display in a browser, it's still invalid HTML code with lot of boilerplate javascript, unbalanced tags and what-have-you-not... So - quite an interesting task to deal with, haha. I want to be able to replicate the process once the website changes, i.e. it's not an option to fix the document problems manually in order to make it parse.

So far, I have managed to download the html file at runtime using OkHttpUtils2 which works ok (I followed the thread about OkHttpUtil2 with Wait For). Now I need to parse it; I want to use data in a table deeply buried in the html - ideally, I would like to just use an XPath to point to the nodes whose information I want. but I could not find a good starting point for that. So, I decided I would like to try it with a DOM parser rather than with SAX, useing the promising XOM library (specification). But since the source document is far from being xhtml, I tried to make it parseable using jTidy. Unfortunately, jTidy creates an empty file, and in the unfiltered log says, among many other things (translated by me, because jTidys log is in German on my machine): "The content of the document looks like HTML 4.01 Transitional. There have been 84 warnings and 11 errrors! This document hat errors that need to be corrected before HTML Tidy can clean it up." So, no luck here, given that manually fixing stuff beforehand at code time is not an option in order to keep it replicable once the website updates at runtime.

So, I tried using the jSoup Parser. Amazingly, it parses the unpreprocessed, faulty HTML file and lets me do this for instance:

B4X:
Dim html As String = LoadHtmlFromDisk(File.DirAssets, "the-document.html")
Dim js As jSoup
Dim tablerows As List
tablerows.Initialize
tablerows = js.getElementsByTag(html, "tr")
This actually gives me all the <tr> tags including their respective <td> columns, with each <tr> being a seperate list item in tablerows, like this:
tablerows.get(0) would for instance return::
<tr>
    <td style="text-align: center">some text</td>
    <td style="text-align: center">other text</td>
    <td style="text-align: center">third column text</td>
    <!-- ... etc ... -->
</tr>
Well, that's a start... But trying to get the text out of each td node, I fail:
B4X:
For i = 0 To tablerows.Size -1
    Dim columns As List : columns.Initialize
    columns = js.selectorElementText(tablerows.Get(i), "td") ' <-- this comes up empty
       
    For j = 0 To columns.Size -1
        Log($"${i}-${j}: ${columns.Get(j)}"$)
    Next
Next
Line #3 results in an empty list.

So, I have (at least 😂) two questions I guess:
  1. How can I correctly gather the text node out of a element node (i.e. the text inside a tag) using jSoup?

  2. Since I would like to use XOM's types to process everything: How can I have jSoup output a cleaned document (which I could then possibly feed jTidy and finally XOM)? I think I would want to use jSoup's "clean_HTML" method, but I can't figure out how to use it... It seems I'd want to use "relaxed" as whitelist level, but I don't know how to parametrize this.
Thank you, everyone, for reading this and possibly even giving me a hint :)
 

Attachments

Last edited:

bububrln

Member
Licensed User
Thank you, Erel, for your reply. I have updated my post above by attaching the unprocessed html source to it as a text file. This is how I downloaded it with OkHttpsUtils2.
 

William Lancee

Active Member
Licensed User
If Erel is working on this, his response will be more elegant. But if you're new to this, the following will show some features of B4X that I like.

B4X:
    Dim aRow As String = $"
<tr>
<td style="text-align: center">some text</td>
<td style="text-align: center">other text</td>
<td style="text-align: center">third column text</td>
<!-- ... etc ... -->
</tr>
"$

    Dim theFields() As String = Regex.Split("<td", aRow)
    Log(theFields.Length)    'this will be 4, the first is: <tr>, so ignore
    Dim fieldList As List: fieldList.Initialize
    For i = 1 To theFields.Length - 1
        Dim result As String = theFields(i)
        Log(result)            'the first of these will be: style="text-align: center">some text</td>
        result = result.SubString2(result.IndexOf(">") + 1, result.IndexOf("<"))
        Log(result)             'the first of these will be: some text
        fieldList.Add(result)
    Next
    
    For Each s As String In fieldList
        Log(s)
    Next
 

OliverA

Expert
Licensed User
It looks like when extracting parts of a table, the whole table has to be there. So change this
B4X:
    columns = js.selectorElementText(tablerows.Get(i), "td") ' <-- this comes up empty
to this
B4X:
   columns = js.selectorElementText($"<table>${tablerows.Get(i)}</table>"$, "td") ' <-- works
 

bububrln

Member
Licensed User
If Erel is working on this, his response will be more elegant. But if you're new to this, the following will show some features of B4X that I like.

B4X:
   ...
Thank you for your effort and for showing me how it can be done. :) I've had the same idea earlier, but - besides that fact that I am not skilled enough yet with B4A to do so - didn't want to go down that road yet, since I am hoing to find a generalizable way to parse any or at least most of ugly/invalid HTML documents with a DOM parser.
 

bububrln

Member
Licensed User
It looks like when extracting parts of a table, the whole table has to be there. So change this
B4X:
    columns = js.selectorElementText(tablerows.Get(i), "td") ' <-- this comes up empty
to this
B4X:
   columns = js.selectorElementText($"<table>${tablerows.Get(i)}</table>"$, "td") ' <-- works
While I can't grasp why this would be neccessary (since a balanced and well-formed subtree like <tr><td>abcd</td></tr> should be enough to read its text-nodes), your solution does solve the problem at hand! So wow. thanks a lot, this works indeed :)
 
Top