B4J Question Extract values from website

bdunkleysmith · May 19, 2014

The attached project is a testbed I'm using for a project where I want to extract some values from a website. I have a fully working version in Visual Basic 2010 and a working testbed in B4A, but I just can't get the extraction working in B4J.

The code:

B4X:

    js.evalString("var x = doc.getElementById('aj_1_score');")
    js.evalString("var y = x.innerHTML;")
    js.evalString("var t = doc.title;")
    Msg.Show(js.engineGet("x"),"engineget_x")
    Msg.Show(js.engineGet("y"),"engineget_y")
    navBar.Text = js.engineGet("y")
    Msg.Show(js.engineGet("t"),"engineget_t")

returns:

[objectHTMLSpanElement]

sun.org.mozilla.javascript.internal.NativeJavaObject@3346e3

sun.org.mozilla.javascript.internal.NativeJavaObject@1ef1245

I saw in another thread that the return of something like sun.org.mozilla.javascript.internal.NativeJavaObject@3346e3 was due to the page not being fully rendered, but this shouldn't be the problem given it's included in a PageFinished sub.

The code which works in B4A is:

B4X:

Sub WebView1_PageFinished (Url As String)
    '    Now that the web page has loaded we can get the data we want
  
    '    see the documentation http://www.b4x.com/forum/additional-libraries-classes-official-updates/12453-webviewextras.html#post70053 for details of the second parameter callUIThread
  
    Dim Javascript As String
    For count = 1 To 2
        TeamScore = "aj_"&count&"_score"
        Javascript="B4A.CallSub('ProcessHTML', true, document.getElementById('"&TeamScore&"').innerHTML, '"&TeamScore&"')"

        'Log("PageFinished: "&Javascript)
        WebViewExtras1.executeJavascript(WebView1, Javascript)
  
    Next
  
End Sub

Sub ProcessHTML(Html As String, Team As String)
    '    This is the Sub that processes the extracted data
  
    'Log(Team & " = " & Html)
    Msgbox(Html,Team)
  
End Sub

Any assistance appreciated.

Daestrum · May 19, 2014

Just trying this with my new JScriptEngine for javafx8 (nashorn - re-written version of of my original Script Engine library)

This is what I get with new engine (just took a few lines and startpage from your example)

Program started.
[object HTMLSpanElement] - engineget_x
78 - engineget_y
LiveStats v4 - engineget_t

I know the layout looks bad - I used log in place of msg

This with old script engine.

Program started.
[object HTMLSpanElement] - engineget_x
78 - engineget_y
LiveStats v4 - engineget_t

I would guess the msg is not converting the type like log does.

try
dim score as string
enginput("score",score)
js.evalString("score = x.innerHTML;")

then use score in msg - I don't use msg so cant test it.

bdunkleysmith · May 19, 2014

Thanks for your assistance, but I used log rather than msg and still got:

Program started.
[object HTMLSpanElement]
sun.org.mozilla.javascript.internal.NativeJavaObject@1c50f6a
sun.org.mozilla.javascript.internal.NativeJavaObject@147e2b6

Perhaps it is some property of the webview which results in the unexpected returns. I thought webview would be platform independent, but perhaps it's inheriting something from my native OS. For instance my VB2010 app won't work for IE8, but is OK for later releases.

Daestrum · May 19, 2014

If you are using java 8 - I have uploaded the jNashorn library to the tail end of the jScriptEngine library thread.
Use just like the jScriptEngine dim js as JNashorn (Nashorn - N-a-s-h-o-r-n looks like ..om on my monitor)

bdunkleysmith · May 19, 2014

Unfortunately both my XP and Win7 machines have Version 7 Update 55 only and it says that's the latest for those platforms, so I don't think I can take advantage of your new library. Is it still work loading the new library?

I was hoping to 'port' my VB2010 app to B4J so it would be platform independent, but I may have to settle to a B4A implementation so it can run on Android rather than Windows.

Did your "few lines" of code use the same method as my code to load the start page? If your simple approach worked then the secret may be in that.

Daestrum · May 19, 2014

this was in the wv1_PageFinished(...) routine
just before the end sub line

B4X:

    navBar.Text = startPage
    js.evalString("var x = doc.getElementById('aj_1_score');")
    js.evalString("var y = x.innerHTML;")
    js.evalString("var t = doc.title;")
    Log(js.engineGet("x")&" - engineget_x")
    Log(js.engineGet("y")&" - engineget_y")
    navBar.Text = js.engineGet("y")
    Log(js.engineGet("t")&" - engineget_t")

and I changed my startpage to yours and loaded normal way

B4X:

    temp = wv1
    we = temp.RunMethod("getEngine",Null)
    wv1.Enabled = True
    wv1.Visible = True
    we.RunMethod("load",Array As Object(startPage))

Just ran your version (apart from message about later version & I deleted manifest file) it ran fine when I changed msg to log.

bdunkleysmith · May 21, 2014

It seems I can't resolve any objects within the document except links, because if I put

B4X:

    js.evalString("var z = doc.links.item(0);")
    Log(js.engineget("z") & " = engineget_z")

then http://webcast-a.live.sportingpulse...10/94/26/33QMmC9UFUvjM//index.html?1399723101 = engineget_z is shown in the log.
Any suggestions where to look for the reasons why the code returns vales as expected for you but not in my environment?

Daestrum · May 21, 2014

did you try my suggestion from post #2

B4X:

dim s as string
js.enginPut("score",s)
js.evalString("score = doc.getElementById('aj_1_score');")
log(js.engineGet("score"))

bdunkleysmith · May 22, 2014

Yes I tried your suggestion Daestrum, but still getting sun.org.mozilla.javascript.internal.NativeJavaObject@?????

I've tried all sorts of things in the code and can obviously find various elements in the page, but can't resolve the content to be able to use them.

I'm a novice in this area, but I wonder if it's something to do with inherited properies of webview and so I need to do something like load the WebViewSettings library described here http://www.b4x.com/android/forum/threads/webviewsettings.12929/ and perhaps use

setDOMStorageEnabled (webView1 As WebView, Enabled As Boolean)
Set whether the DOM storage API is enabled.

Erel · May 22, 2014

Why don't you download the page with jHttpUtils2?

bdunkleysmith · May 22, 2014

OK Erel. So this is what I did:

B4X:

Sub ScoreButton_MouseClicked (EventData As MouseEvent)
    Dim job As HttpJob
    job.Initialize("GetData", Me)
    job.Download(startPage)             
End Sub
Sub JobDone(Job As HttpJob)
    If Job.Success Then
        Log(Job.GetString)
        Dim page As String = job.GetString     
        js.engineput("page",page) 
        js.evalString("var x = page.getElementById('aj_1_score');")
        js.evalString("var y = x.innerHTML;")
        js.evalString("var t = page.title;")
        Log(js.engineGet("x") & " = engineget_x")
        Log(js.engineGet("y") & " = engineget_y")
        Log(js.engineget("t") & " = engineget_t")
    Else
        Log("Job Error: " & Job.ErrorMessage)
    End If
    Job.Release
End Sub

But while the log shows the downloaded content is as expected, the contents of aj_1_score and title show the same as the original method. Below is the log, but I've removed a substantial amount of the page content for simplicity.

Should I be using a different metod to scrape the contents of the specified elements when using the HttpJob.Download method?

bdunkleysmith · May 22, 2014

In fact this method is not finding the specified elements because I realised the I'd duplicated use of the x, y, z & t variable and the original values were being shown. So in the code above I changed x to xx, y to yy, z to zz & t to tt, but the logged output was:

sun.org.mozilla.javascript.internal.Undefined@a462b7 = engineget_xx
sun.org.mozilla.javascript.internal.Undefined@a462b7 = engineget_yy
sun.org.mozilla.javascript.internal.Undefined@a462b7 = engineget_zz
sun.org.mozilla.javascript.internal.Undefined@a462b7 = engineget_tt

I'm obvioulsy approacing this the wrong way.

Erel · May 22, 2014

You should upload the html response as a text file. It is difficult to see it this way.

I would have chosen a different way to parse it. You can convert the html to xml with Tidy library and then use a XML parser to parse it.

bdunkleysmith · May 23, 2014

Thanks Erel.

I've attached the B4A testbed project which successfuly extracts data from the target website. What I've found interesting about this website is that while fields like the scores I've used as an example display in the browser OK, the acual value doesn't appear in the HTML (source) file attached.

For instance the source shows <td class="period-cell"><span id="aj_1_p1_score"></span></td> but document.getElementById('aj_1_score').innerHTML returns 78 as the value correctly in my B4A project.

Therefore I'm not sure taking the HTML, converting it to XML and parsing it will give the desired result.

But the question still remains as to why my B4J code returns the score values when Daestrum runs it as per post #2 and it returns sun.org.mozilla.javascript.internal.NativeJavaObject@?????? for me.

Erel · May 23, 2014

But the question still remains as to why my B4J code returns the score values when Daestrum runs it as per post #2 and it returns sun.org.mozilla.javascript.internal.NativeJavaObject@?????? for me.

Sorry but I'm not familiar with jScriptEngine library.

The data is fetched with additional calls. Use Firebug to see it. Seems like you can get all the data from this JSON string:
http://webcast-a.live.sportingpulseinternational.com/matches/9/10/94/26/33QMmC9UFUvjM//data.json

This is much simpler than parsing the html or injecting JavaScript.

bdunkleysmith · May 23, 2014

I can't thank you enough and using your JsonTree tool all is revealed. Absolutely brilliant!

Thanks again Erel

bdunkleysmith · Jun 26, 2014

Erel,

I've been able to do almost all I want to do based on the data available from data.json, but for there is some data, eg. player number, which inexplicably isn't in that data. So I'm chasing more. I know playerdata.json gives me some more information, but I'm wondering how you saw that webpage fetched data from additional calls by using Firebug.

I've set Firebug as an add-on in Firefox, but I've not seen how to use it to display additional calls made by the page. Can you please give me a quick pointer in the right direction?

Thanks,

Bryon

Erel · Jun 26, 2014

Check the Net tab when you refresh the page:

bdunkleysmith · Jun 26, 2014

Thank you very much. I can now see what's happening. It looks like I'll need to use a technique similar to what I did originally to get the player number field as that's only in the page HTML and not contained in the json data.

bdunkleysmith · Jun 27, 2014

To get the additional data from the HTML given my javascript method above returns only the object and not the content, I tried the suggestion of converting the HTML to XML using the parser in the jTidy library and parsing that with the jXmlSax parser.

B4X:

Dim out As OutputStream = File.OpenOutput(File.DirApp, "temp.html", False)
File.Copy2(j.GetInputStream, out)
out.Close

tid.Initialize
tid.Parse(File.OpenInput(File.DirApp, "temp.html"), File.DirApp, "temp.xml")
sax.Initialize
sax.Parse(File.OpenInput(File.DirApp, "temp.xml"), "sax")

The jTidy parser logs some warnings as per the log file attached. However the jXmlSax parser throws fatal errors from the XML file. The HTML and converted XML files are attached.

The first fatal error is:

[Fatal Error] :120:30: The content of elements must consist of well-formed character data or markup.

Line 120 if(jQuery(window).width() < 727) {

If I delete the script containing that from the XML file and re-run it I get another fatal error:

[Fatal Error] :118:3: The element type "link" must be terminated by the matching end-tag "</link>".

Line 117 

So it seems the source HTML is not suitable for parsing with jTidy as an input to the jXmlSax parser.

Does anyone have a working example of extracting HTML element values by parsing an XML created by parsing HTML?

A typical source URL for me is http://webcast-a.live.sportingpulseinternational.com/matches/9/11/04/11/30DQrvCMPyDmc/index.html

B4J Question Extract values from website

Active Member

Attachments

Expert

Active Member

Expert

Active Member

Expert

Active Member

Expert

Active Member

B4X founder

Active Member

Active Member

B4X founder

Active Member

Attachments

B4X founder

Active Member

Active Member

B4X founder

Active Member

Active Member

Attachments

Similar Threads