Parsing HTML page help

walterf25

Expert
Licensed User
Longtime User
Hello everyone, i need some help i'm working on a new app, but i'm stuck on something, i don't really understand the REGEX function and i really need some help, the HTML information i need to extract is for example, file size, uploaded by: Seeders: Leechers: and Quality.

Attached is a text file with the HTML information, and here's is a portion of what i need to extract.

<dt>Size:</dt>
<dd>704.4&nbsp;MiB&nbsp;(738613989&nbsp;Bytes)</dd>
<br />

<dt>Info:</dt>
<dd><a href="http://www.imdb.com/title/tt1931533/" target="_blank" title="IMDB" rel="nofollow">IMDB</a></dd> <dt>Spoken language(s):</dt>
<dd>English</dd>


<dt>Tag(s):</dt>
<dd><a href="/tag/Seven">Seven</a> <a href="/tag/Psychopaths">Psychopaths</a> <a href="/tag/2012">2012</a> <a href="/tag/DVDSCR">DVDSCR</a> <a href="/tag/XviD">XviD</a> <a href="/tag/AbSurdiTy">AbSurdiTy</a> </dd>
<dt>Quality:</dt>
<dd id="rating" class="">
+9 / -1 (+8) </dd>

<br />

<dt>Uploaded:</dt>
<dd>2013-01-07 15:18:04 GMT</dd>

<dt>By:</dt>
<dd>
<a href="/user/scene4all/" title="Browse scene4all">scene4all</a>&nbsp;<img src="//static.thepiratebay.se/img/vip.gif" alt="VIP" title="VIP" style="width:11px;" border='0' /></dd>
<br />

<dt>Seeders:</dt>
<dd>19546</dd>

<dt>Leechers:</dt>
<dd>2974</dd>

I can extract this information by using this pattern:
<dt>.*</dt>

[1] => <dt>Files:</dt>
[2] => <dt>Size:</dt>
[3] => <dt>Info:</dt>
[4] => <dt>Spoken language(s):</dt>
[5] => <dt>Tag(s):</dt>
[6] => <dt>Quality:</dt>
[7] => <dt>Uploaded:</dt>
[8] => <dt>By:</dt>
[9] => <dt>Seeders:</dt>
[10] => <dt>Leechers:</dt>
[11] => <dt>Comments</dt>
[12] => <dt>Info Hash:</dt>

but i also need to get the information that goes after each of those tags.

Can anyone please help me with this, or does anyone have any examples i can use?

:BangHead:

View attachment HTML_Code.txt

Thanks Everyone!
Cheers,
Walter
 

warwound

Expert
Licensed User
Longtime User
You could parse the HTML as XML using the B4A XmlSax library or my XOM library.
As long as your HTML is well formed/valid it'd be far easier than any regular expression approach i think.

Martin.
 
Upvote 0

walterf25

Expert
Licensed User
Longtime User
XOM Library

Hi WarWound, thanks for the suggestion i actually started playing with your library but i can't seem to get it to work, the example you provided works just fine.
I'm a little confused :D the data i receive from the website i'm working with is just HTML data, when i save the data to a file and name it let's say "2.xml" and then i try to use that file instead of the "weather.xml" you provided in your example it doesn't trigger the
B4X:
Sub XOMBuilder1_BuildDone(XOMDocument1 As XOMDocument, Tag As Object)
event, which tells me that the file doesn't get built, again i'm sorry for my ignorance but i don't seem to understand the whole concept, i mean the data i get from the website is in HTML format, so is this library to convert this type of data into XML format?

If so how can i do this, what i have tried so far isn't working for me.

thanks,
Walter
 
Upvote 0

warwound

Expert
Licensed User
Longtime User
Hi.

HTML is hyper text markup language - it's a subset of XML.
So HTML can be considered to be XML and parsed accordingly.

First thing to check is that your HTML is well formed/valid.
You have two easy options:
  • Rename the HTMl file to something.xml and open it in a desktop browser. The browser will either render the XML as formatted text or tell you it contains an error (is not well formed) and it should tell you what line the error occurs on.
  • Validate the HTML in an online XML validator such as XML Validation: XML Validation

If you HTML is not valid then there is no way that the XOM library will be able to parse it - XOM is very strict and must have valid XML.
I'm not sure how the B4A SaxXml library handles XML that is not valid - it might or might nor work.

If the HTML is not valid and it comes from a 3rd party source (you have no control over it's creation so cannot fix bugs) then there is little you can do on the device to fix it.
You might be able to load it into a WebView (as HTML not XML) and let the WebView render it, the WebView will do it's best to workaround invalid HTML and render as much of the page as possible.
Then if you injected some javascript into the WebView to grab the rendered HTML, you'd have a fixed version.
That's untested - the WebView might return the original HTML or it might return the HTML that it has corrected any errors in, i'm not sure which it will return.
Either way it's a clumsy way to fix invalid HTML.

An idea - you could have a PHP script hosted online which acts as a proxy/repairer script.
You send the URL of the HTML page to this script, the script would fetch the HTML and do it's best to fix errors then return the fixed HTML to your device.
This is doable for an HTML page that contains minor errors, but if the HTML is very badly written then it's unlikely the PHP script would be able to fix it.

Can you test your original HTML - see if it validates in the online validator and see if a desktop browser can open it if you renamed it's file extension to .xml?

Post with your results - i can help with the XOM parsing code if you have valid HTMl otherwise upload the HTML so i can see how badly formed it is.

Martin.
 
Upvote 0

walterf25

Expert
Licensed User
Longtime User
XOM Library

Hi WarWound, i've tried the xml validator on the link you posted, unfortunately it doesn't work with the html file i have even if i rename it to file.xml.

here is the html file, maybe you can look at it and see how well or how bad it is formatted.

Thanks for your help WarWound, i really appreciate it!

View attachment movies.zip

Cheers,
Walter
 
Upvote 0

warwound

Expert
Licensed User
Longtime User
The HTML is invalid.
I renamed movies.html to movis.xml and drag/dropped the xml file onto Firefox.
Firefox reports:

XML Parsing Error: mismatched tag. Expected: </img>.
Line Number 220, Column 164:
<a href="http://cdn1.adexprt.com/lp/2.php?name=Silver.Linings.Playbook.2012.DVDRIP-EDAW2013"><img src="http://static.thepiratebay.org/img/bar.gif" border="0"></a>

This is an XHTML webpage and the img tag should be self-closed, something like:

B4X:
<a href="http://cdn1.adexprt.com/lp/2.php?name=Silver.Linings.Playbook.2012.DVDRIP-EDAW2013"><img src="http://static.thepiratebay.org/img/bar.gif" border="0" /></a>

Manually fixing that error and loading the xml into Firefox now shows another error later on in the file.

So you have a typical poorly written HTML document to parse...

Have a look at this page: Parsing of badly formated HTML in PHP - Stack Overflow
If you have some webspace with PHP support then you could create a PHP script which your application calls and passes the URL of the required webpage to.
The PHP script would fetch the webpage and do it's best to make the webpage HTML valid/well-formed and then the script would return the webpage to your application.

You could probably optimise things - if your application just wants some of the data from the webpage HTML then the PHP script could (hopefully!) fix the original HTML, extract the data you require and return just the data that your application requires.
The script could return that data as JSON, XML or plain text - whatever is easiest for your application to parse.
This solution obviously reduces the amount of parsing that your application has to do on the device - offloading most of the parsing to the server running the PHP script.

Have you got access to a webserver that has support for both PHP and cURL?
(cURL would be used by the script to request the webpage from the server).

Martin.
 
Upvote 0

walterf25

Expert
Licensed User
Longtime User
The HTML is invalid.
I renamed movies.html to movis.xml and drag/dropped the xml file onto Firefox.
Firefox reports:



This is an XHTML webpage and the img tag should be self-closed, something like:

B4X:
<a href="http://cdn1.adexprt.com/lp/2.php?name=Silver.Linings.Playbook.2012.DVDRIP-EDAW2013"><img src="http://static.thepiratebay.org/img/bar.gif" border="0" /></a>

Manually fixing that error and loading the xml into Firefox now shows another error later on in the file.

So you have a typical poorly written HTML document to parse...

Have a look at this page: Parsing of badly formated HTML in PHP - Stack Overflow
If you have some webspace with PHP support then you could create a PHP script which your application calls and passes the URL of the required webpage to.
The PHP script would fetch the webpage and do it's best to make the webpage HTML valid/well-formed and then the script would return the webpage to your application.

You could probably optimise things - if your application just wants some of the data from the webpage HTML then the PHP script could (hopefully!) fix the original HTML, extract the data you require and return just the data that your application requires.
The script could return that data as JSON, XML or plain text - whatever is easiest for your application to parse.
This solution obviously reduces the amount of parsing that your application has to do on the device - offloading most of the parsing to the server running the PHP script.

Have you got access to a webserver that has support for both PHP and cURL?
(cURL would be used by the script to request the webpage from the server).

Martin.

Hi Martin, sorry for not replying to this sooner, I got busy with other projects, i need to update one of my apps, and i was wondering if you were willing to help me with this, i have no experience with php or scripts, what I need is to parse an html page, and retrieve certain information to display on my application, are you available if so let me know, and let me know how much you would charge me to help me with this.

Thanks,
Walter
 
Upvote 0

Mark Read

Well-Known Member
Licensed User
Longtime User
Hello Walter,

I have not had time to complete this project but it should get you well started. I took your movies.html and parsed it twice, once to reformat and second to get the info. Takes about 7 seconds on my slow emulator. Sorry that it is not perfect but I have to work as well. Hope it helps you.
Best regards from Austria
Mark
 

Attachments

  • HTMLParser.zip
    12.8 KB · Views: 448
Upvote 0
Top