B4A Library [B4X] MiniHtmlParser - simple html parser implemented with B4X

Erel · Jun 3, 2020

- 0.91 - Fixes an issue with text after the last element.

Erel · Jun 29, 2020

- 0.92 - Unescapes more entities including entities written with the unicode code point, e.g. ℵ

Erel · Aug 11, 2020

- 0.93 - Fixes an issue with whitespace characters being removed too aggressively.

Erel · Oct 20, 2020

- 0.94 - New FindDirectNodes method. Returns a list with the direct child nodes that match the tag name and optionally the attribute.
New IsNodeMatches methods - tests whether the given node matches the tag name and optionally the attribute.
Example was updated. It was broken by the change in v0.93. It is now built using FindDirectNodes and is more robust than the previous implementation.

MathiasM · Mar 19, 2021

Hello Erel

I like this library very much. But a small question: What was the design intention of HTMLParser.GetTextFromNode() instead of HTMLNode.Text ?
Or FindNode() or other similar calls.
The second one looks more 'B4X-like' than the method used now.

Thanks a lot for this library!

Erel · Mar 21, 2021

Technically it is designed like this because HtmlNode is a user type and not a class by itself so it cannot have methods. It is a bit faster like this compared to a full class, though it could have been designed differently.

aeric · Feb 5, 2025

Suggestion update:

B4X:

Private Sub ParseAttributes (Parent As HtmlNode)
    Dim start As Int = Index
    ReadUntil(">")
    Dim s As String = mHtml.SubString2(start, Index - 1)
    For Each EscapeChar As String In Array("'", $"""$)
        'allow attribute names contain dashes (-), e.g data-value or aria-label
        'Dim m As Matcher = Regex.Matcher($"(\w+)\s*=\s*\${EscapeChar}([^${EscapeChar}]+)\${EscapeChar}"$, s)
        Dim m As Matcher = Regex.Matcher($"([a-zA-Z0-9-]+)\s*=\s*\${EscapeChar}([^${EscapeChar}]+)\${EscapeChar}"$, s)
        Do While m.Find
            Parent.Attributes.Add(CreateHtmlAttribute(m.Group(1), m.Group(2)))
        Loop
    Next
End Sub

Erel · Feb 6, 2025

- 0.95 - Fixes an issue with attributes keys containing dashes. Thank you @aeric for the fix!

aeric · Nov 3, 2025

There are attributes without values or called boolean attributes e.g disabled, required.
The current parser ignores these attributes.

With help from DeepSeek, I have made the following modification to ParseAttributes and so far it is working fine for my MiniHtml library.

B4X:

Private Sub ParseAttributes (Parent As HtmlNode)
    Dim start As Int = Index
    ReadUntil(">")
    Dim s As String = mHtml.SubString2(start, Index - 1)
    ' Parse attributes with values (key="value" or key='value')
    For Each EscapeChar As String In Array("'", QUOTE)
        Dim m As Matcher = Regex.Matcher($"([a-zA-Z0-9-]+)\s*=\s*\${EscapeChar}([^${EscapeChar}]+)\${EscapeChar}"$, s)
        Do While m.Find
            Parent.Attributes.Add(CreateHtmlAttribute(m.Group(1), m.Group(2)))
        Loop
    Next
    
    ' Parse boolean attributes (standalone keys like "disabled", "selected", "required")
    ' More precise regex to avoid matching class values
    Dim m As Matcher = Regex.Matcher($"\b([a-zA-Z0-9-]+)(?=\s*[/>]|\s*$)"$, s)
    Do While m.Find
        Dim attrName As String = m.Group(1)
        ' Skip if this is part of a key=value pair (already processed above)
        Dim isAlreadyProcessed As Boolean = False
        For Each existingAttr As HtmlAttribute In Parent.Attributes
            If existingAttr.Key = attrName Then
                isAlreadyProcessed = True
                Exit
            End If
        Next
        
        ' Also check if this might be part of a class value by looking at the context
        If isAlreadyProcessed = False Then
            ' More validation: check if this looks like a valid boolean attribute
            ' Common boolean attributes in HTML
            Dim commonBooleanAttrs As List = Array As String("disabled", "readonly", "checked", "required", "selected", "multiple", "autofocus", "novalidate", "formnovalidate", "hidden")

            If commonBooleanAttrs.IndexOf(attrName) > -1 Then
                Parent.Attributes.Add(CreateHtmlAttribute(attrName, attrName))
            Else
                ' Log unexpected boolean attributes for debugging
                Log($"Warning: Unexpected boolean attribute: ${attrName}"$)
            End If
        End If
    Loop
End Sub

Here is my simple test:

B4X:

Sub Test
    Dim s As String = $"<form action="" method="post">
    <label for="username">Username:</label><br>
    <input type="text" id="username" name="username" class="c11 c12 c13" style="s11:v11; s12:v12;" required><br>
    <br>
    <label for="country">Country:</label><br>
    <select id="country" name="country" class="c21 c22 c23" style="s21: v21; s22: v22">
        <option disabled>Select a country</option>
        <option value="malaysia">Malaysia</option>
        <option value="other">Other</option>
    </select><br>
    <br>
    <input type="submit" value="Submit">
</form>"$
    HtmlParser.Initialize
    Dim root As HtmlNode = HtmlParser.Parse(s)
    If root.IsInitialized Then
        Dim root1 As HtmlNode = root.Children.Get(0)
        Log($"root: ${root1.name}"$)
        For Each attr As HtmlAttribute In root1.Attributes
            Log($"${attr.Key} = ${attr.Value}"$)
        Next
        For Each child As HtmlNode In root1.Children
            Log($"  child: ${child.name}"$)
            For Each attr1 As HtmlAttribute In child.Attributes
                If attr1.Key = attr1.Value Then
                    LogColor($"  ${attr1.Key} = ${attr1.Value}"$, -65536)
                Else
                    Log($"  ${attr1.Key} = ${attr1.Value}"$)
                End If
            Next
            For Each grandchild As HtmlNode In child.Children
                LogColor($"    child: ${grandchild.name}"$, -16776961)
                For Each attr2 As HtmlAttribute In grandchild.Attributes
                    If attr2.Key = attr2.Value Then
                        LogColor($"    ${attr2.Key} = ${attr2.Value}"$, -65536)
                    Else
                        LogColor($"    ${attr2.Key} = ${attr2.Value}"$, -16776961)
                    End If
                Next
                For Each grand2child As HtmlNode In grandchild.Children
                    LogColor($"      child: ${grand2child.name}"$, -65281)
                    For Each attr3 As HtmlAttribute In grand2child.Attributes
                        LogColor($"      ${attr3.Key} = ${attr3.Value}"$, -65281)
                    Next
                Next
            Next
        Next
    End If
End Sub

B4A Library [B4X] MiniHtmlParser - simple html parser implemented with B4X

Attachments

Erel

B4X founder

Erel

B4X founder

Erel

B4X founder

Erel

B4X founder

MathiasM

Active Member

Erel

B4X founder

aeric

Expert

Erel

B4X founder

aeric

Expert

Similar Threads