To get the additional data from the HTML given my javascript method above returns only the object and not the content, I tried the suggestion of converting the HTML to XML using the parser in the jTidy library and parsing that with the jXmlSax parser.
Dim out As OutputStream = File.OpenOutput(File.DirApp, "temp.html", False)
File.Copy2(j.GetInputStream, out)
out.Close
tid.Initialize
tid.Parse(File.OpenInput(File.DirApp, "temp.html"), File.DirApp, "temp.xml")
sax.Initialize
sax.Parse(File.OpenInput(File.DirApp, "temp.xml"), "sax")
The jTidy parser logs some warnings as per the log file attached. However the jXmlSax parser throws fatal errors from the XML file. The HTML and converted XML files are attached.
The first fatal error is:
[Fatal Error] :120:30: The content of elements must consist of well-formed character data or markup.
Line 120 if(jQuery(window).width() < 727) {
If I delete the script containing that from the XML file and re-run it I get another fatal error:
[Fatal Error] :118:3: The element type "link" must be terminated by the matching end-tag "</link>".
Line 117 <!--[if lt IE 9]>
Line 118 <link rel="stylesheet" type="text/css" href="/matches/css/v2_ie.css">
Line 119 <![endif]-->
So it seems the source HTML is not suitable for parsing with jTidy as an input to the jXmlSax parser.
Does anyone have a working example of extracting HTML element values by parsing an XML created by parsing HTML?
A typical source URL for me is
http://webcast-a.live.sportingpulseinternational.com/matches/9/11/04/11/30DQrvCMPyDmc/index.html