HTML parsing issue on some phones

Inman

Well-Known Member
Licensed User
Longtime User
In one of my apps, I am downloading an HTML file using HttpClient and then parse out the HTML code. This is working fine in most phones, including mine but certain users are reporting parse errors. When I looked into the HTML code on their devices, I realised what happened.

This is how the HTML code is supposed to be

B4X:
<select name=nDesign id="vote-rate-design" title="Design"><option>1</option><option>2</option><option>3</option><option>4</option><option selected>5</option><option>6</option><option>7</option><option>8</option><option>9</option><option>10</option></select> 
<select name=nFeatures id="vote-rate-features" title="Features"><option>1</option><option>2</option><option>3</option><option>4</option><option selected>5</option><option>6</option><option>7</option><option>8</option><option>9</option><option>10</option></select>

But this is how some phones are returning it

B4X:
<select name=nDesign id="vote-rate-design" title="Design"><option>1</option><option>2</option><option>3</option><option>4</option><option selected>5</option><option>6</option><option>7</option><option>8</option><option>9</option><option>10</option></select> <select name=nFeatures id="vote-rate-features" title="Features"><option>1</option><option>2</option><option>3</option><option>4</option><option selected>5</option><option>6</option><option>7</option><option>8</option><option>9</option><option>10</option></select>

Basically on those phones, the whole HTML code is in a single line without line breaks. But on my phone it contains line breaks and I can split those lines using Regex.Split(CRLF,html).

I download the file using HttpClient and read the contents in the following way

B4X:
TextReader1.Initialize(File.OpenInput(File.DirInternalCache, "page.html"))
result=TextReader1.ReadAll

Since that returns this string without any line breaks, I also tried this

B4X:
result=File.ReadString(File.DirInternalCache, "page.html")

But the result is still the same. On certain phones the entire HTML source code is in a single line without line break. How can I convert it into normal form, from which I can split the lines using Regex.Split(CRLF,html)?
 

Inman

Well-Known Member
Licensed User
Longtime User
Thanks margret, that would work in this case. But there is more to this HTML code. I only posted a couple of lines from it. I am hoping we can find a generic solution that would parse the line break correctly.

I was wondering if this has something to do with character encoding.
 
Upvote 0

Inman

Well-Known Member
Licensed User
Longtime User
In case of <select> tags, yes. And there are many too. But there are also other stuff like <tr>, <td>, <a> etc... And unfortunately this can change as the pages are dynamic. Which is why we need a generic solution.
 
Upvote 0

warwound

Expert
Licensed User
Longtime User
I think your problem lies elsewhere.
HTML requires no line breaks or formatting to be valid.

Try validating the original webpage here: The W3C Markup Validation Service, does it validate?

If so now try to validate your parsed version of the page that has no line breaks - it should still be valid.

You could write an entire webpage as a single line and it will validate if the HTML contains no errors.
(And lack of nicely formatted HTML with no line breaks is NOT an error).

Can you try to load the version of the webpage (with no line breaks - the version that does not render correctly) into a desktop browser and look for other reasons why it fails to render?
If you load it into Chrome or newer versions of IE then you can use the built in browser debugger to find what's going wrong.

Martin.
 
Upvote 0

Inman

Well-Known Member
Licensed User
Longtime User
Yes the page is getting validated. My issue is that I parse this web page line by line. So it is crucial that I should be able to split this source code into single lines.

But what I don't understand is why does the source code come in beautifully formatted HTML with line breaks and all, in majority of phones, including mine, while on some phones all the HTML code is in a single line? It is not Android version specific because some of the users who reported are using Android 2.3.x as well as 4.x.. It is not manufacturer specific as well.

I am thinking of 2 possibilities

1) For some reason the HttpClient is downloading the web page without line breaks
2) The downloaded file does contain line breaks but the way in which the file is read by the app (either with TextReader or File.ReadString) results in the omission of line break character

I am trying to get one of the users to mail me the downloaded HTML file so that we can confirm. I am hoping No. 2 is the reason. Again it is a mystery why it is happening only on certain phones. If the problem is with character encoding of the downloaded file, could it be fixed with TextReader.Initialize2(InputStream,Encoding)?
 
Last edited:
Upvote 0

citywest

Member
Licensed User
Longtime User
Inman as Warwound points out linefeeds, carriage returns etc are not required by HTML parsers and are not part of the spec.

One solution is to zip them up at the server end, download to the client, unzip them, process the html string in whatever way you require and then pass them on to the browser.

Mark S.
 
Upvote 0

Inman

Well-Known Member
Licensed User
Longtime User
I am downloading these pages (and then parse) from a public website, which is why I don't have control over server's end.
 
Upvote 0

citywest

Member
Licensed User
Longtime User
Inman,

OK so you have no control over the server end, nor anything in between.

Another option is to add back the "missing" detail. The following javascript I've used from time to time will add crlf to each tag set.

It shouldn't take much to convert to B4A. Although you may need to remove some tags from the arrays to achieve your desired results

Cheers,

Mark S.

B4X:
function buildReadableHTML(HtmlString) {
   var readableHTML = HtmlString;
   var crlf = '\r\n';
   
   var headerTags = [ "<html","</html>","</head>","<title","</title>","<meta","<link","<style","</style>","</body>" ];
   for ( i = 0; i < headerTags.length; ++i) {
       var header = headerTags[i];
       readableHTML = readableHTML.replace(new RegExp(headerTags, 'gi'), crlf + header);
   }

   var bodyTags = [ "</div>","</span>","</form>","</fieldset>","<br>","<br />","<hr","<pre","</pre>","<blockquote","</blockquote>","<ul","</ul>","<ol","</ol>","<li","<dl","</dl>","<dt","</dt>","<dd","</dd>","<\!--","<table","</table>","<caption","</caption>","<th","</th>","<tr","</tr>","<td","</td>","<script","</script>","<noscript","</noscript>" ];
   for ( i = 0; i < bodyTags.length; ++i ) {
       var body = bodyTags[i];
       readableHTML = readableHTML.replace(new RegExp(body,'gi'), crlf + body);
   }
   
   var formTags = [ "<label","</label>","<legend","</legend>","<object","</object>","<embed","</embed>","<select","</select>","<option","<option","<input","<textarea","</textarea>" ];
   for ( i = 0; i < formTags.length; ++i ) {
       var form = ftags[i];
       readableHTML = readableHTML.replace(new RegExp(form,'gi'), crlf + form);
   }

   var xtraTags = ["<body","<head","<div","<span","<p","<form","<fieldset"];
   for ( i = 0; i <xtraTags.length; ++i) {
   var extra = xtraTags[i];
       readableHTML = readableHTML.replace(new RegExp(extra,'gi'), crlf + crlf + extra);
   }
   return readableHTML;
};
 
Upvote 0

Inman

Well-Known Member
Licensed User
Longtime User
Thanks Mark. Looks like I will have to use your logic.

I got the downloaded HTML file from one of my users who had this issue. And then I downloaded the same file in my phone. His file contains 47 lines, while my file contains 597 lines.

So one thing is confirmed. There is nothing wrong with TextReader or any other file reading mechanism. The problem is entirely with the way HttpClient downloads a file. On certain phones it downloads files without line breaks on each line, while on most phones it downloads with all line breaks.

I wonder if some parameter or setting could be enabled in HttpClient or HttpRequest to make sure it downloads the file with all line break characters.
 
Upvote 0

bluejay

Active Member
Licensed User
Longtime User
Note that some mobile networks use Web Accelerators (also known as PEP - Performance Enhancing Proxy).

So anything that is not part of the HTML spec could get stripped out.

Server web pages can also get optimised on the way to the mobile eg reordering pages to display text before graphics or encoding large graphics for small screens.

So what is sent by the server could be different to what the mobile receives.

bluejay
 
Upvote 0

Inman

Well-Known Member
Licensed User
Longtime User
If that is the case, I think it is better to implement something similar to Mark's function.
 
Upvote 0

warwound

Expert
Licensed User
Longtime User
Ahhh i just re-read the thread...

The problem is not that the WebView won't render HTML that contains no line breaks but that your HTML parsing code expects the web page to contain line breaks!

A solution might be to parse the web page HTML using my XOM library.
(Or the B4A SaxParser).

As long as the HTML is well formed it should work and you'll have a more robust solution.

Martin.
 
Upvote 0

Inman

Well-Known Member
Licensed User
Longtime User
Somehow I missed your library, Martin. And I was under the impression that B4A SaxParser works only with XML and nothing else.

I managed to implement a function similar to Mark's. But your library is very interesting to me as almost all my apps involve HTML parsing.
 
Upvote 0
Top