B4J Question miniHtmlParser help

walterf25

Expert
Licensed User
Longtime User
Hi all, I have never used this library but I would like to try and parse the following URL here what I am trying to do is parse Opera Librettos, if you guys can check the URL you'll notice that there's the Original language and the translated which both are placed in a table, I am having a hard time wrapping my head around how the minihtmlparser library works, I have seen some examples in the forum but since I am not an expert in HTML either.

So basically what I need is to just get the English version on the left column and the Original version on the right column, can someone shed some light on how I would be able to manage this?

Thanks everyone.
Walter
 

William Lancee

Well-Known Member
Licensed User
Longtime User
Interesting project. See attached code and tested zip.

B4X:
Sub Process_Globals
    Private HtmlParser As MiniHtmlParser
End Sub

Sub AppStart (Args() As String)
    HtmlParser.Initialize
    Dim root As HtmlNode = HtmlParser.Parse(File.ReadString(File.DirAssets, "opera.html"))
    Dim div As HtmlNode = HtmlParser.FindNode(root, "div", HtmlParser.CreateHtmlAttribute("id", "content"))
    Dim table As HtmlNode = HtmlParser.FindNode(div, "table", HtmlParser.CreateHtmlAttribute("width", "100%"))
    Dim tbody As HtmlNode = table.Children.Get(1)
    For Each nodex As HtmlNode In tbody.children
        If nodex.Name = "tr" Then
            Dim data As List = nodex.Children
            For Each tdata As HtmlNode In data
                If tdata.Name = "td" Then
                    For i = 1 To tdata.Children.Size - 1
                        Dim chNode As HtmlNode = tdata.Children.Get(i)
                        If chNode.Name = "text" Then
                            Log(HtmlParser.GetTextFromNode(tdata, i))
                        End If
                    Next
                    Log("_________________")
                End If
            Next
        End If
    Next
End Sub
 

Attachments

  • operettas.zip
    45.4 KB · Views: 29
Upvote 1

walterf25

Expert
Licensed User
Longtime User
Interesting project. See attached code and tested zip.

B4X:
Sub Process_Globals
    Private HtmlParser As MiniHtmlParser
End Sub

Sub AppStart (Args() As String)
    HtmlParser.Initialize
    Dim root As HtmlNode = HtmlParser.Parse(File.ReadString(File.DirAssets, "opera.html"))
    Dim div As HtmlNode = HtmlParser.FindNode(root, "div", HtmlParser.CreateHtmlAttribute("id", "content"))
    Dim table As HtmlNode = HtmlParser.FindNode(div, "table", HtmlParser.CreateHtmlAttribute("width", "100%"))
    Dim tbody As HtmlNode = table.Children.Get(1)
    For Each nodex As HtmlNode In tbody.children
        If nodex.Name = "tr" Then
            Dim data As List = nodex.Children
            For Each tdata As HtmlNode In data
                If tdata.Name = "td" Then
                    For i = 1 To tdata.Children.Size - 1
                        Dim chNode As HtmlNode = tdata.Children.Get(i)
                        If chNode.Name = "text" Then
                            Log(HtmlParser.GetTextFromNode(tdata, i))
                        End If
                    Next
                    Log("_________________")
                End If
            Next
        End If
    Next
End Sub
That actually works very well, however I am trying to parse the content directly by using the OKTTHP library

B4X:
    Dim job As HttpJob
    job.Initialize("", Me)
    job.Download("https://www.murashev.com/opera/La_traviata_libretto_English_Italian")
    Wait For (job) JobDone(job As HttpJob)
    If job.Success Then
        '''Log(job.GetString)
        s = job.GetString
        File.WriteString(File.DirApp, "opera.html", s)
    End If
    job.Release

It doesn't seem to work by doing this way, and also, is there a way to extract each column side by side, what I mean is that I would like to extract each English Row and it's Italian row side by side, rather than extracting all the rows of the English translation at once and then all the Italian rows at once?

Walter
 
Upvote 0

William Lancee

Well-Known Member
Licensed User
Longtime User
The web site is inconsistent with the use of <tbody> in a <table>. I test for that now and bypass the issue.
The pairing of items is some simple string and list manipulation. See attached .zip.
 

Attachments

  • operettas2.zip
    9.2 KB · Views: 35
Upvote 0

walterf25

Expert
Licensed User
Longtime User
That actually works very well, however I am trying to parse the content directly by using the OKTTHP library

B4X:
    Dim job As HttpJob
    job.Initialize("", Me)
    job.Download("https://www.murashev.com/opera/La_traviata_libretto_English_Italian")
    Wait For (job) JobDone(job As HttpJob)
    If job.Success Then
        '''Log(job.GetString)
        s = job.GetString
        File.WriteString(File.DirApp, "opera.html", s)
    End If
    job.Release

It doesn't seem to work by doing this way, and also, is there a way to extract each column side by side, what I mean is that I would like to extract each English Row and it's Italian row side by side, rather than extracting all the rows of the English translation at once and then all the Italian rows at once?

Walter
Wow, this works very good, there are some minor glitches but I think I can deal with those. For example on the fifth paragraph of the libretto for some reason the Actor Name is not included in the parsed text.
libretto.JPG

Only the original text and translated text are included.

But thanks again so much for your help, it would have taken me ages to figure this out.

Walter
 
Upvote 0

walterf25

Expert
Licensed User
Longtime User
You are right. The italics are not handled properly.
See revised solution attached.
I still see the same problem, on the same paragraph it doesn't parse the TITLE but it should be OK for now.

Thanks again for your help.
 
Upvote 0

William Lancee

Well-Known Member
Licensed User
Longtime User
The titles are in non-official HTML <act></act> tags. But it is possible once you know that - see attached .zip.
Since you probably don't have control over how the website composes its results, you should be prepared for
the possibility that my code won't work in the future.

However, if you are motivated enough you can see how I parsed the html and then follow my example.
A lot of it informed trial and error.

Good luck with your efforts.
 

Attachments

  • Operettas4.zip
    9.3 KB · Views: 36
Upvote 0
Top