Android Question Extracting text source website

lock255 · Mar 14, 2014

Hello everyone, I can not play a function that I use a lot in vb.net, basically allows you to extract text from a web page without a source of unique references:

B4X:

Dim HTML as String
Dim V1,V2 as Object
Dim V as String
HTML = WebBrowser1.DocumentText.ToString
                V1 = Split(HTML, Chr(34) & ">")
                V2 = Split(V1(56), "</a></h2>")
                V = V2(0)

How can I play the same operation with B4A?

Erel · Mar 16, 2014

Do you want to get the WebView text?

It is easier to download it with HttpUtils2.

DonManfred · Mar 16, 2014

lock255 said:
How can I play the same operation with B4A?

If you really need the webview (cause you want to show it to the user) then this could be of help.

But if you just need the html and dont want to show the html to the user then, as erel already suggested, using httputils2 should be the better alternative.

See this example

B4X:

Sub Activity_Create(FirstTime As Boolean)
    'Do not forget to load the layout file created with the visual designer. For example:
    'Activity.LoadLayout("Layout1")
    Dim php As HttpJob
    php.Initialize("htmltest",Me)
    php.Download("http://www.google.com/")
End Sub

Sub JobDone(Job As HttpJob)
    ProgressDialogHide
    If Job.Success Then
        Dim res As String
        res = Job.GetString
        Log("JobName: "&Job.JobName)
        If Job.JobName = "htmltest" Then
            Log("HTML is: "&res)
        Else If Job.JobName = "Init" Then
          Log("")
        End If
    Else
        ToastMessageShow("Error: " & Job.ErrorMessage, True)
    End If
    Job.Release
End Sub

lock255 · Mar 16, 2014

In fact I need to retrieve the exact text from the source code of a page we not only as a reference sentences
As in the first example I did:

B4X:

V2 = Split(V1(56), "</a></h2>")

indicates that the text to take part after the 56 th: </a></h2> present in the source.
I hope I was clear and I apologize for my bad English.

Erel · Mar 16, 2014

There are two parts for this problem. First you need to download the text. The code @DonManfred posted will help you with that.

The second part is to parse the string. You can use Regex.Split if you want to split the string. It is usually better to use jTidy library to convert the html to XML and then use an XML parser to parser it.

lock255 · Mar 18, 2014

With the advice of @DonManfred I got the text html, now I'm using the library that I've recommended to convert text html to xml.
In the example that you have done in the first page of this thread: http://www.b4x.com/android/forum/threads/jtidy-library-convert-html-pages-to-xml.27038/
I was not one thing clear.

That is, I must save the text in an html file locally (in your example, index.html).
If it were possible I would not pass the file, but the html text directly obtained in a varibile?

Basco · Mar 18, 2014

thanks for your job, this thread helped me a lot !

Android Question Extracting text source website

lock255

Well-Known Member

Erel

B4X founder

DonManfred

Expert

Attachments

lock255

Well-Known Member

Erel

B4X founder

lock255

Well-Known Member

Basco

Member

Similar Threads