B4A Library [B4X] MiniHtmlParser - simple html parser implemented with B4X

MiniHtmlParser is a cross platform class that parses html strings and creates a tree with the various elements.
It is a less powerful alternative to jTidy or jSoup, however it is simple to use, cross platform and as it is implemented in B4X it can be extended quite easily.
Note that many real-world html pages are not 100% valid. The parser tries to handle a few cases, far from browsers which can handle many common html problems.

The example demonstrates how to use the parser.
It parses the html saved from: https://www.x-rates.com/table/?from=USD&amount=1 and finds the rates from the top 10 currencies. This is only done as an example.

Depends on B4XCollections.


Updates:

- 0.95 - Fixes an issue with attributes keys containing dashes. Thank you @aeric for the fix!
- 0.94 - New FindDirectNodes method. Returns a list with the direct child nodes that match the tag name and optionally the attribute.
New IsNodeMatches methods - tests whether the given node matches the tag name and optionally the attribute.
Example was updated. It was broken by the change in v0.93. It is now built using FindDirectNodes and is more robust than the previous implementation.

- 0.93 - Fixes an issue with whitespace characters being removed too aggressively.
- 0.92 - Unescapes more entities including entities written with the unicode code point, e.g. ℵ
- 0.91 - Fixes an issue with text after the last element.
 

Attachments

  • Example.zip
    8.9 KB · Views: 1,562
  • MiniHtmlParser.b4xlib
    2.8 KB · Views: 258
Last edited:

Erel

B4X founder
Staff member
Licensed User
Longtime User
- 0.94 - New FindDirectNodes method. Returns a list with the direct child nodes that match the tag name and optionally the attribute.
New IsNodeMatches methods - tests whether the given node matches the tag name and optionally the attribute.
Example was updated. It was broken by the change in v0.93. It is now built using FindDirectNodes and is more robust than the previous implementation.
 

MathiasM

Active Member
Licensed User
Hello Erel

I like this library very much. But a small question: What was the design intention of HTMLParser.GetTextFromNode() instead of HTMLNode.Text ?
Or FindNode() or other similar calls.
The second one looks more 'B4X-like' than the method used now.

Thanks a lot for this library!
 

aeric

Expert
Licensed User
Longtime User
Suggestion update:
B4X:
Private Sub ParseAttributes (Parent As HtmlNode)
    Dim start As Int = Index
    ReadUntil(">")
    Dim s As String = mHtml.SubString2(start, Index - 1)
    For Each EscapeChar As String In Array("'", $"""$)
        'allow attribute names contain dashes (-), e.g data-value or aria-label
        'Dim m As Matcher = Regex.Matcher($"(\w+)\s*=\s*\${EscapeChar}([^${EscapeChar}]+)\${EscapeChar}"$, s)
        Dim m As Matcher = Regex.Matcher($"([a-zA-Z0-9-]+)\s*=\s*\${EscapeChar}([^${EscapeChar}]+)\${EscapeChar}"$, s)
        Do While m.Find
            Parent.Attributes.Add(CreateHtmlAttribute(m.Group(1), m.Group(2)))
        Loop
    Next
End Sub
 

aeric

Expert
Licensed User
Longtime User
There are attributes without values or called boolean attributes e.g disabled, required.
The current parser ignores these attributes.

With help from DeepSeek, I have made the following modification to ParseAttributes and so far it is working fine for my MiniHtml library.
B4X:
Private Sub ParseAttributes (Parent As HtmlNode)
    Dim start As Int = Index
    ReadUntil(">")
    Dim s As String = mHtml.SubString2(start, Index - 1)
    ' Parse attributes with values (key="value" or key='value')
    For Each EscapeChar As String In Array("'", QUOTE)
        Dim m As Matcher = Regex.Matcher($"([a-zA-Z0-9-]+)\s*=\s*\${EscapeChar}([^${EscapeChar}]+)\${EscapeChar}"$, s)
        Do While m.Find
            Parent.Attributes.Add(CreateHtmlAttribute(m.Group(1), m.Group(2)))
        Loop
    Next
    
    ' Parse boolean attributes (standalone keys like "disabled", "selected", "required")
    ' More precise regex to avoid matching class values
    Dim m As Matcher = Regex.Matcher($"\b([a-zA-Z0-9-]+)(?=\s*[/>]|\s*$)"$, s)
    Do While m.Find
        Dim attrName As String = m.Group(1)
        ' Skip if this is part of a key=value pair (already processed above)
        Dim isAlreadyProcessed As Boolean = False
        For Each existingAttr As HtmlAttribute In Parent.Attributes
            If existingAttr.Key = attrName Then
                isAlreadyProcessed = True
                Exit
            End If
        Next
        
        ' Also check if this might be part of a class value by looking at the context
        If isAlreadyProcessed = False Then
            ' More validation: check if this looks like a valid boolean attribute
            ' Common boolean attributes in HTML
            Dim commonBooleanAttrs As List = Array As String("disabled", "readonly", "checked", "required", "selected", "multiple", "autofocus", "novalidate", "formnovalidate", "hidden")

            If commonBooleanAttrs.IndexOf(attrName) > -1 Then
                Parent.Attributes.Add(CreateHtmlAttribute(attrName, attrName))
            Else
                ' Log unexpected boolean attributes for debugging
                Log($"Warning: Unexpected boolean attribute: ${attrName}"$)
            End If
        End If
    Loop
End Sub

Here is my simple test:
B4X:
Sub Test
    Dim s As String = $"<form action="" method="post">
    <label for="username">Username:</label><br>
    <input type="text" id="username" name="username" class="c11 c12 c13" style="s11:v11; s12:v12;" required><br>
    <br>
    <label for="country">Country:</label><br>
    <select id="country" name="country" class="c21 c22 c23" style="s21: v21; s22: v22">
        <option disabled>Select a country</option>
        <option value="malaysia">Malaysia</option>
        <option value="other">Other</option>
    </select><br>
    <br>
    <input type="submit" value="Submit">
</form>"$
    HtmlParser.Initialize
    Dim root As HtmlNode = HtmlParser.Parse(s)
    If root.IsInitialized Then
        Dim root1 As HtmlNode = root.Children.Get(0)
        Log($"root: ${root1.name}"$)
        For Each attr As HtmlAttribute In root1.Attributes
            Log($"${attr.Key} = ${attr.Value}"$)
        Next
        For Each child As HtmlNode In root1.Children
            Log($"  child: ${child.name}"$)
            For Each attr1 As HtmlAttribute In child.Attributes
                If attr1.Key = attr1.Value Then
                    LogColor($"  ${attr1.Key} = ${attr1.Value}"$, -65536)
                Else
                    Log($"  ${attr1.Key} = ${attr1.Value}"$)
                End If
            Next
            For Each grandchild As HtmlNode In child.Children
                LogColor($"    child: ${grandchild.name}"$, -16776961)
                For Each attr2 As HtmlAttribute In grandchild.Attributes
                    If attr2.Key = attr2.Value Then
                        LogColor($"    ${attr2.Key} = ${attr2.Value}"$, -65536)
                    Else
                        LogColor($"    ${attr2.Key} = ${attr2.Value}"$, -16776961)
                    End If
                Next
                For Each grand2child As HtmlNode In grandchild.Children
                    LogColor($"      child: ${grand2child.name}"$, -65281)
                    For Each attr3 As HtmlAttribute In grand2child.Attributes
                        LogColor($"      ${attr3.Key} = ${attr3.Value}"$, -65281)
                    Next
                Next
            Next
        Next
    End If
End Sub

Logs
 
Top