B4J Question miniHtmlParser help

walterf25 · Jan 11, 2024

Hi all, I have never used this library but I would like to try and parse the following URL here what I am trying to do is parse Opera Librettos, if you guys can check the URL you'll notice that there's the Original language and the translated which both are placed in a table, I am having a hard time wrapping my head around how the minihtmlparser library works, I have seen some examples in the forum but since I am not an expert in HTML either.

So basically what I need is to just get the English version on the left column and the Original version on the right column, can someone shed some light on how I would be able to manage this?

Thanks everyone.
Walter

William Lancee · Jan 11, 2024

Interesting project. See attached code and tested zip.

B4X:

Sub Process_Globals
    Private HtmlParser As MiniHtmlParser
End Sub

Sub AppStart (Args() As String)
    HtmlParser.Initialize
    Dim root As HtmlNode = HtmlParser.Parse(File.ReadString(File.DirAssets, "opera.html"))
    Dim div As HtmlNode = HtmlParser.FindNode(root, "div", HtmlParser.CreateHtmlAttribute("id", "content"))
    Dim table As HtmlNode = HtmlParser.FindNode(div, "table", HtmlParser.CreateHtmlAttribute("width", "100%"))
    Dim tbody As HtmlNode = table.Children.Get(1)
    For Each nodex As HtmlNode In tbody.children
        If nodex.Name = "tr" Then
            Dim data As List = nodex.Children
            For Each tdata As HtmlNode In data
                If tdata.Name = "td" Then
                    For i = 1 To tdata.Children.Size - 1
                        Dim chNode As HtmlNode = tdata.Children.Get(i)
                        If chNode.Name = "text" Then
                            Log(HtmlParser.GetTextFromNode(tdata, i))
                        End If
                    Next
                    Log("_________________")
                End If
            Next
        End If
    Next
End Sub

walterf25 · Jan 11, 2024

William Lancee said:

Interesting project. See attached code and tested zip.

B4X:

Sub Process_Globals
    Private HtmlParser As MiniHtmlParser
End Sub

Sub AppStart (Args() As String)
    HtmlParser.Initialize
    Dim root As HtmlNode = HtmlParser.Parse(File.ReadString(File.DirAssets, "opera.html"))
    Dim div As HtmlNode = HtmlParser.FindNode(root, "div", HtmlParser.CreateHtmlAttribute("id", "content"))
    Dim table As HtmlNode = HtmlParser.FindNode(div, "table", HtmlParser.CreateHtmlAttribute("width", "100%"))
    Dim tbody As HtmlNode = table.Children.Get(1)
    For Each nodex As HtmlNode In tbody.children
        If nodex.Name = "tr" Then
            Dim data As List = nodex.Children
            For Each tdata As HtmlNode In data
                If tdata.Name = "td" Then
                    For i = 1 To tdata.Children.Size - 1
                        Dim chNode As HtmlNode = tdata.Children.Get(i)
                        If chNode.Name = "text" Then
                            Log(HtmlParser.GetTextFromNode(tdata, i))
                        End If
                    Next
                    Log("_________________")
                End If
            Next
        End If
    Next
End Sub

That actually works very well, however I am trying to parse the content directly by using the OKTTHP library

B4X:

    Dim job As HttpJob
    job.Initialize("", Me)
    job.Download("https://www.murashev.com/opera/La_traviata_libretto_English_Italian")
    Wait For (job) JobDone(job As HttpJob)
    If job.Success Then
        '''Log(job.GetString)
        s = job.GetString
        File.WriteString(File.DirApp, "opera.html", s)
    End If
    job.Release

It doesn't seem to work by doing this way, and also, is there a way to extract each column side by side, what I mean is that I would like to extract each English Row and it's Italian row side by side, rather than extracting all the rows of the English translation at once and then all the Italian rows at once?

Walter

William Lancee · Jan 11, 2024

The web site is inconsistent with the use of <tbody> in a <table>. I test for that now and bypass the issue.
The pairing of items is some simple string and list manipulation. See attached .zip.

walterf25 · Jan 12, 2024

walterf25 said:
That actually works very well, however I am trying to parse the content directly by using the OKTTHP library

B4X:

Dim job As HttpJob job.Initialize("", Me) job.Download("https://www.murashev.com/opera/La_traviata_libretto_English_Italian") Wait For (job) JobDone(job As HttpJob) If job.Success Then '''Log(job.GetString) s = job.GetString File.WriteString(File.DirApp, "opera.html", s) End If job.Release

It doesn't seem to work by doing this way, and also, is there a way to extract each column side by side, what I mean is that I would like to extract each English Row and it's Italian row side by side, rather than extracting all the rows of the English translation at once and then all the Italian rows at once?

Walter

Wow, this works very good, there are some minor glitches but I think I can deal with those. For example on the fifth paragraph of the libretto for some reason the Actor Name is not included in the parsed text.

Only the original text and translated text are included.

But thanks again so much for your help, it would have taken me ages to figure this out.

Walter

William Lancee · Jan 12, 2024

You are right. The italics are not handled properly.
See revised solution attached.

mcqueccu · Jan 12, 2024

Wow @William Lancee Thanks for the help

walterf25 · Jan 15, 2024

William Lancee said:
You are right. The italics are not handled properly.
See revised solution attached.

I still see the same problem, on the same paragraph it doesn't parse the TITLE but it should be OK for now.

Thanks again for your help.

William Lancee · Jan 15, 2024

The titles are in non-official HTML <act></act> tags. But it is possible once you know that - see attached .zip.
Since you probably don't have control over how the website composes its results, you should be prepared for
the possibility that my code won't work in the future.

However, if you are motivated enough you can see how I parsed the html and then follow my example.
A lot of it informed trial and error.

Good luck with your efforts.

B4J Question miniHtmlParser help

walterf25

Expert

William Lancee

Well-Known Member

Attachments

walterf25

Expert

William Lancee

Well-Known Member

Attachments

walterf25

Expert

William Lancee

Well-Known Member

Attachments

mcqueccu

Well-Known Member

walterf25

Expert

William Lancee

Well-Known Member

Attachments

Similar Threads