Android Tutorial Parse HTML code

netchicken · Jan 1, 2012

Wow, I have been thinking over the last few days how to get started on a program that extracts information from a webpage. I was totally at a loss for a beginning point, now you post this!!

Thanks so much i look forward to working with it over the next week or so.

Gary

NeoTechni · Jan 1, 2012

You're very welcome.

giammy · Jan 26, 2012

problem with parse html

Hi, I tried to use the example of httputils, but after you parse the html page and have obtained the string, then this is not updated and remains the value obtained from 1 parse html. how can I upgrade?

code:

Sub Globals
Dim b4a As String
b4a = "http://www.b4x.com"
End Sub

Sub Activity_Create (FirstTime As Boolean)
HttpUtils.CallbackActivity = "Main" 'Current activity name.
HttpUtils.CallbackJobDoneSub = "JobDone"
HttpUtils.Download("Job1", b4a)
End Sub

Sub JobDone (Job As String)
Dim s As String
If HttpUtils.IsSuccess(b4a) Then
s = HttpUtils.GetString(b4a)
End If
End Sub

NeoTechni · Jan 27, 2012

After "s = HttpUtils.GetString(b4a)" you didn't actually do anything with the HTML.

peacemaker · Mar 5, 2012

How to clean the HTML ? I mean extract just the plain text from it ?

So, it needs to delete all the tags, saving the text only.
Please, suggest modifications.

NeoTechni · Mar 5, 2012

Just remove the 2 select case structures for handling tags

Select Case Name.ToLowerCase

and

Select Case Name.ToLowerCase.Replace("/", "")

klaus · Mar 5, 2012

Be careful:
in B4A it's Select variable and not Select Case variable like in VB.

Best regards.

NeoTechni · Mar 5, 2012

klaus said:
Be careful:
in B4A it's Select variable and not Select Case variable like in VB.

Best regards.

Incorrect. I've been using select case and it works.

Erel · Mar 5, 2012

You can use Select Case or Select. Both should work.

walterf25 · Jan 23, 2013

HTML Parsing

Hi Neotechmi, i actually need some help parsing an HTML file, i'm working on this app where you can download movies, music, ebooks etc.. straight into your phone, but i'm having some problems parsing the content.

the url is this http://thepiratebay.se/top/all

can you maybe help out with this, i saw your module but i can't seem to follow it to modify it to my specific needs.

Thanks, and please let me know if you can help me out with this!

cheers,
Walter

NeoTechni · Jan 25, 2013

Sure I can help. You're just trying to extract a list of URLs?

B4X:

Sub GetTag(HTML As String, Tag As String) As String 
   Return GetBetween(HTML, " " &  Tag  & "=" & GPlus.vbQuote, GPlus.vbQuote)
End Sub

Sub EnumAHREFs(HTMLCode As String)As List 
   Dim temp As Int,temp2 As Int, htag As String  ,Name As String ,temp3 As String, Node As String
   Dim tempstr As String ,HREFS As List  ', tempstr As StringBuilder
   HREFS.Initialize 
   Do Until temp >= HTMLCode.Length OR temp<0 'OR tempstr.Length > MaxStringBuilderLength
      tempstr=""
      'Log(temp & "/" & HTMLCode.Length)
      If Mid(HTMLCode, temp,1) = "<" Then
         temp2=HTMLCode.IndexOf2(">", temp+1)
         htag=Mid(HTMLCode, temp,temp2-temp+1)
         temp=temp2+1
         Name=GetTagName(htag)
         Select Case Name.ToLowerCase 
            Case "a"', "script", "title", "h1", "h2", "h3", "header","footer","style"
               'If Not( name.EqualsIgnoreCase("a") AND htag.Contains(" name=") ) Then
                  temp3 = HTMLCode.IndexOf2("</" & Name, temp2)
                  Node=Mid(HTMLCode, temp2+1,temp3-temp2-1).Replace("&quot;", "'").Trim 
                  temp2=HTMLCode.IndexOf2(">", temp3+1)
                  temp=temp2+1
                  'Log("NODE2:" & Node)
               'End If
         End Select
         
         Select Case Name.ToLowerCase.Replace("/", "")
            Case "a"':      tempstr= tempstr & MakeLCARbutton(lcar.LCAR_Orange, node)
               'Log("HTML: " & htag)
               'Log("TAG: " & GetTagName(htag))
               HREFS.Add( htag )
         End Select
         
      Else
         temp2=HTMLCode.IndexOf2("<", temp+1)
         If temp2>-1 Then
            htag=Mid(HTMLCode, temp,temp2-temp).Trim
         Else
            temp2=HTMLCode.Length 
            htag=Right(HTMLCode, temp2-temp).Trim 
         End If
         temp=temp2
      End If
   Loop

   Return HREFS
End Sub

warwound · Jan 25, 2013

walterf25 said:
Hi Neotechmi, i actually need some help parsing an HTML file, i'm working on this app where you can download movies, music, ebooks etc.. straight into your phone, but i'm having some problems parsing the content.

the url is this The Pirate Bay - The galaxy's most resilient bittorrent site

can you maybe help out with this, i saw your module but i can't seem to follow it to modify it to my specific needs.

Thanks, and please let me know if you can help me out with this!

cheers,
Walter

Hi walterf25.

Did you follow up on my post to your thread here: http://www.b4x.com/forum/basic4andr.../25274-parsing-html-page-help.html#post146819?

I reckon a server-side proxy script would be far better and cope with badly written HTML - i can help with the PHP if required.

Martin.

jalle007 · Mar 4, 2013

Hi Neo

Very good library and I think its just I was looking for.
I wonder if you can help me with this. I have this simple page:
SIA.ba - mobile
where I need to extract table (just one in the page) and its rows data.

NeoTechni · Mar 5, 2013

Turns out I forgot to post the GetBetween API... But that'd do most of it for you. I'll post it once I get home

You just need to find the start and end of the table
I replaced the HTML start/end brackets with { } so they'd show

Start: {table style="width:100%; font-size:11px"}
End: {/table}

Getbetween would get everything between those, which is the data you'd need
Then replace the tab character with nothing, to get rid of garbage data

Then you can regex.split the text on {/tr} which would separate it by row into an array which you can loop through, again using GetBetween on
{strong} and {/strong}
and {span} and {/span}

anytime there's nothing between {span} and {/span} you'd treat it as a label rather than a value

NeoTechni · Mar 8, 2013

B4X:

Sub GetBetween(Text As String, Start As String, Finish As String) As String 
   Dim temp As Int,temp2 As Int
   temp=Text.IndexOf(Start)
   If temp>-1 Then
      temp2=Text.IndexOf2(Finish, temp+ Start.Length  +1)
      Return Mid(Text, temp+Start.Length,temp2-temp-Start.Length)
   End If
End Sub

Sub Left(Text As String, Length As Long)As String 
   If Text.Length>0 AND Length>0 Then
      'If Length>Text.Length Then Length=Text.Length 
      Return Text.SubString2(0, Min(Text.Length,Length))
   End If
   Return ""
End Sub

Sub Right(Text As String, Length As Long) As String
   If Text.Length>0 AND Length>0 Then
      'If Length>Text.Length Then Length=Text.Length 
      Return Text.SubString(Text.Length-Min(Text.Length,Length))
   End If
   Return ""
End Sub
Sub Mid(Text As String, Start As Int, Length As Int) As String 
   If Length>0 AND Start>-1 AND Start< Text.Length Then Return Text.SubString2(Start,Start+Length)
End Sub

walterf25 · Jun 13, 2013

Help parsing HTML

Hi Neo, i'm back trying to update my app, I need an easy way to parse a very bad formatted html file, I know you posted an example for me, but i have not been able to figure out how to make it work, i'm at it again, but i'm stuck at this function

B4X:

Sub GetTag(HTML As String, Tag As String) As String 
    Return GetBetween(HTML, " " &  Tag  & "=" & GPlus.vbQuote, GPlus.vbQuote)
End Sub

what exactly is GPlus.vbquote, is this another library i'm missing?

Can you point me in the right direction?

Thanks,
Walter

NeoTechni · Jun 13, 2013

Ah, my bad.

B4X:

dim VBquote as string = """"

I doubt my code works well with badly formatted code. I think I was lazy.

walterf25 · Jun 13, 2013

Help parsing HTML

Thanks, but what is the variable Gplus?

NeoTechni · Jun 13, 2013

I put VBquote in that library and didn't fix the reference.
It's for parsing Google Plus, Twitter and now Facebook

Android Tutorial Parse HTML code

netchicken

Active Member

NeoTechni

Well-Known Member

giammy

New Member

NeoTechni

Well-Known Member

peacemaker

Expert

NeoTechni

Well-Known Member

klaus

Expert

NeoTechni

Well-Known Member

Erel

B4X founder

walterf25

Expert

NeoTechni

Well-Known Member

warwound

Expert

jalle007

Active Member

NeoTechni

Well-Known Member

NeoTechni

Well-Known Member

walterf25

Expert

NeoTechni

Well-Known Member

walterf25

Expert

NeoTechni

Well-Known Member

Similar Threads