Android Question Parseing web pages (the DOM)

Albert Kallal · Mar 19, 2021

So, last night, I came across this post:
Http Download a CSV or Json File for Android OS Versions | B4X Programming Forum

So the poster goes to a web page with the Android versions - sees a new one and then edits a csv.txt file by hand and then his cute little program can display a list of andriod versions. All very simple.

So, a quick google for android versions. We get this page:

Android version history - Wikipedia

.... and so on. Not a large table.

So, I thought, why don't I pull that above table out of the web page DOM. I have ALWAYS considered a web page a DOM, and more so what we call XAML, or so called "zammel".

As as general rule - these pages tend to be created with developer tools - not hand coded anymore. As a result, they are NOTHING more then a simple set of start tags, and end tags. (in effect a xml cube - nothing more, nothing less). As a result, I can take say a 10 year old version of the MSXML library - point it to that page, and BOOM! - I have my xml cube all nice and ready.

So, Ok, this should be "easy" - after all we talking about the web - that supposed "fad" ? ;-)

Ok, so I looked around for that DOM things for B4A - most posts pushed towards using jTidy.

So, I have this:

B4X:

Sub ScrapePage2

    ProgressBar1.Visible = True
    Sleep(0)    ' show progress bar
    Dim j As HttpJob
    j.Initialize("j",Me)
    j.Download("https://en.wikipedia.org/wiki/Android_version_history")
    Wait For (j) JobDone(j As HttpJob)
        
    If j.Success = False Then
        ProgressBar1.Visible = False
        Return
    End If
    
    Dim MyTidy As Tidy
    MyTidy.Initialize
    Log("convert and write as xml")
    MyTidy.Parse(j.GetInputStream,File.DirInternal,"web1.xml")

    Log("done - writing xml file")
    Dim strXML As String = File.ReadString(File.DirInternal,"web1.xml")
    Log("got xml - length = " & strXML.Length)
    
    If strXML.Length = 0 Then
        Return
    End If

    Dim x As Xml2Map
    x.Initialize
    Dim MyDom As Map
    
    Log("parsing")
    MyDom = x.Parse(strXML)
    Log("done parse")
    
    For Each skey As String In MyDom.Keys
        Log(skey)
    Next

But jTitdyshows this error:

line 3,600 column 1 - Error: <nav> is not recognized!

Huh? Nav? Ok, count to 10 - don't make a rant!!! (the community here is too kind for a rant!!!).

But, gee??? "<nav>"? That's everywhere!!!

This VERY web page I am reading looks to have a bootstrap menu - and has this:

B4X:

<nav class="p-nav">
        <div class="p-nav-inner">
            <a class="p-nav-menuTrigger" data-xf-click="off-canvas" data-menu=".js-headerOffCanvasMenu" role="button" tabindex="0">
                <i aria-hidden="true"></i>
                <span class="p-nav-menuText">Menu</span>
            </a>
.....

So, my view is that jTity should not care or bother about the kinds of tags it supports, but ONLY care that it has a <start></start> (start + end tag). So this is about the closest you see of a rant by me!

Why should jTidy even care, bother or care about some start/end tag it don't know about?

Now this "idea" only started since I though why not answer that posters question - post a few lines of code that grabs that "table" from that page, and off we go.

Helping that poster aside? I'm having difficulty parsing out that web page. I don't think I should be (but then again - that's my camp and my shortcoming here).
And to be fair - while the technology to do these things is mature right now - it still a coding task, and I should give this problem more respect when I am not!!!).

Now I did give SaxParser a go at chewing on the page. No, surprise - it eat the whole page without a problem.

So, this bit does get me the elements (but posts here suggest to not do this, and besides, it don't really give me much of a object model in all that great of a way to work with).

B4X:

    ' Dim MySax As SaxParser moved to gbl def
     MySax.Initialize
     MySax.Parse(j.GetInputStream,"MySax")

And then this:
Sub MySax_EndElement(Uri As String, Divname As String, strText As Object)

   If Divname = "td" Or Divname = "table" Then
        txtTableRaw.Text = txtTableRaw.Text & CRLF & strText
          Log(strText)
    End If    
End Sub

The above - just picks apart the page - just works - no surprise.

My libraries manager shows jTidy 1.1 - perhaps a newer version is floating around?

I can't say this is high priority and holding me back right now.

But I will say that this ability to grab a DOM from a web page? yes, that task will come up sooner or later in my Android travels.

Suggestions here - what road to take for this?

Regards,
Albert D. Kallal
Edmonton, Alberta Canada

Erel · Mar 21, 2021

1. jTidy is a thin wrapper above Tidy library.
2. I don't recommend using SaxParser directly. In most cases it will be much simpler to use Xml2Map which is based on SaxParser.
3. There is a html parser implemented in B4X named MiniHtmlParser. I recommend using this one.

B4X:

Sub Parse
    Dim j As HttpJob
    j.Initialize("",Me)
    j.Download("https://en.wikipedia.org/wiki/Android_version_history")
    Wait For (j) JobDone(j As HttpJob)
    If j.Success Then
        Dim parser As MiniHtmlParser
        parser.Initialize
        Dim root As HtmlNode = parser.Parse(j.GetString)
        Dim table As HtmlNode = parser.FindNode(root, "table", parser.CreateHtmlAttribute("class", "wikitable"))
'        parser.PrintNode(table)
        Dim tbody As HtmlNode = parser.FindNode(table, "tbody", Null)
        For Each tr As HtmlNode In parser.FindDirectNodes(tbody, "tr", Null)
            Dim counter As Int
            For Each td As HtmlNode In parser.FindDirectNodes(tr, "td", Null)
                counter = counter + 1
                If counter = 1 Then
                    Dim a As HtmlNode = td.Children.Get(0)
                    If a.Name = "a" Then
                        Dim title As HtmlAttribute = a.Attributes.Get(1)
                        Log("*********************")
                        Log(title.Value)
                        Continue
                    End If
                End If
                Log(parser.GetTextFromNode(td, 0))
            Next
        Next
    End If
    j.Release
End Sub

No official codename
1.0
September 23, 2008
No
1
cite_ref-unofficial_and_official_codenames_9-1
1.1
February 9, 2009
No
2
cite_ref-unofficial_and_official_codenames_9-2
*********************
Android Cupcake
1.5
April 27, 2009
No
3
cite_ref-:0_14-2
*********************
Android Donut
1.6
September 15, 2009
No
4
cite_ref-:0_14-3
*********************
Android Eclair
2.0 – 2.1
October 26, 2009
No
5 – 7
cite_ref-:0_14-4
*********************
Android Froyo
2.2 – 2.2.3
May 20, 2010
No
8
cite_ref-:0_14-5
*********************
Android Gingerbread
2.3 – 2.3.7
December 6, 2010
No
9 – 10
cite_ref-:0_14-6
*********************
Android Honeycomb
3.0 – 3.2.6
February 22, 2011
No
11 – 13
cite_ref-:0_14-7
*********************
Android Ice Cream Sandwich
4.0 – 4.0.4
October 18, 2011
No
14 – 15
cite_ref-:0_14-8
*********************
Android Jelly Bean
4.1 – 4.3.1
July 9, 2012
No
16 – 18
cite_ref-:0_14-9
*********************
Android KitKat
4.4 – 4.4.4
October 31, 2013
No
19 – 20
cite_ref-:0_14-10
*********************
Android Lollipop
5.0 – 5.1.1
November 12, 2014
No
21 – 22
cite_ref-:0_14-11
*********************
Android Marshmallow
6.0 – 6.0.1
October 5, 2015
No
23
cite_ref-:0_14-12
*********************
Android Nougat
7.0 – 7.1.2
August 22, 2016
No
24 – 25
cite_ref-:0_14-13
*********************
Android Oreo
8.0
August 21, 2017
No
26
cite_ref-:0_14-14
8.1
December 5, 2017
Yes
27
cite_ref-:0_14-15
*********************
Android Pie
9
August 6, 2018
Yes
28
cite_ref-:0_14-16
*********************
Android 10
10
September 3, 2019
Yes
29
cite_ref-:0_14-17
*********************
Android 11
11
September 8, 2020
Yes
30
cite_ref-:0_14-18
*********************
Android 12
12
TBA
Presupported
31
cite_ref-:0_14-19

Albert Kallal · Mar 23, 2021

As always - much appreciated. (and having you stepping in? Well, ok - that's oh too kind).

I actually spent time on the weekend working at this.

Lots of choices! - There are "so many" choices here! I looked at quite a few options. saxParse is actually quite nice - but too much of a 'scalpel' for this stuff.

And again that just shows how great B4A is. While saxParse is being depreciated, the same libarary for Andriod has a xml version.

Regardless? TONS of choices - probably too many! - So, I wanted a good "advice" on which library to pick!

Where was I going with this? Well, as I noted, i don't really see/view web pages as web pages! They are blocks of tags.

So, want a list of Windows versions?
Ok then this:

So, want a list of Android versions? ok, then this:

So, on start - I list out the tables found on that site
eg;

So, the idea is we often see/look/find at a list of "some thing" on a web site. So, grab that table, say export as Excel and you have that table!

So, this concept has a million uses, and being able with ease to hit a site? list out the tables on that site and then pick one?

Well, then you don't really think of this as web scraping as much as show me tables on that page, pick one, and display it!
And adding say a Email as Excel (csv) makes such a little idea rather useful - since now near any data or list you have on the web?
Just grab it - its yours as a table!!! And right into Excel!

Once again, B4A comes though with flying colors.

Once again, thank you so kindly.

This effort needs some clean up and love + care. But, I am happy to post this small app here when I have it working a bit better. And I'm thinking of sending results to B4xTable - as that allows even more ideas!

Regards,
Albert D. Kallal
Edmonton, Alberta Canada

Erel · Mar 23, 2021

SaxParser is not deprecated. Xml2Map is based on SaxParser and solves most of the challenges related to XML parsing.

Parsing html should be done with a html parser, not XML parser.

Albert Kallal · Mar 23, 2021

SaxParser is not deprecated

Ah, ok - it was my readings on Android dev site - it stated that sax1 was replaced with a newer version (not that it been depreciated - by bad, my wrong).

While it not a html parser per say, it works very well against web pages - on my quest, I found it does a good job on say building a list of "tables" from a given web site.
eg this:

B4X:

Sub MySax_startElement(Uri As String, Divname As String, strText As Attributes)

    If Divname = "table" Then
        TableLevel = TableLevel + 1
        Dim s As String
        s = strText.GetValue(0)
        TableList.Add(s)
    End If

End Sub

(I used the dim s as string to get better casting - (another question for another day! as to how to do that better).

I found the above to get that "list" of tables from that page worked rather well (better then jTidity and some other libraries I was attempting to use).
It is quite tolerant of a mix of web markup and XML - and seems not at all limited to just xml parsing.
(as I noted, I tend to think of those web pages as start/end tags - and so is xml). But, of course they not "really" the same thing!

Even to this day, when I view a web page "as source"? I cut + paste that into a xml viewer (Visual Studio) - and that process has always worked well.

Now of course - chalk this up to me learning - not YET having landed on a better road and approach.

(But, sax was impressive! - and how cool of a concept to use a event driven model for parsing? - that can keep memory requirements rather low) What a neat idea!)

So, I'll still keep what I learned about sax here in my future tool box. (and hey, we HAVE this choice in B4A!!!).

Thankfully, this post has given me bricks of gold here - this is not only a great reflection of B4A, but this incredible forum.

No doubt, this is a job for HTML parser.

Again, much thanks to you, and the great forums here - its appreciated.

Regards,
Albert D. Kallal
Edmonton, Alberta Canada

Android Question Parseing web pages (the DOM)

Albert Kallal

Active Member

Erel

B4X founder

Albert Kallal

Active Member

Erel

B4X founder

Albert Kallal

Active Member

Similar Threads