HTML parsing issue on some phones

Inman · Dec 18, 2012

In one of my apps, I am downloading an HTML file using HttpClient and then parse out the HTML code. This is working fine in most phones, including mine but certain users are reporting parse errors. When I looked into the HTML code on their devices, I realised what happened.

This is how the HTML code is supposed to be

B4X:

<select name=nDesign id="vote-rate-design" title="Design"><option>1</option><option>2</option><option>3</option><option>4</option><option selected>5</option><option>6</option><option>7</option><option>8</option><option>9</option><option>10</option></select> 
<select name=nFeatures id="vote-rate-features" title="Features"><option>1</option><option>2</option><option>3</option><option>4</option><option selected>5</option><option>6</option><option>7</option><option>8</option><option>9</option><option>10</option></select>

But this is how some phones are returning it

B4X:

<select name=nDesign id="vote-rate-design" title="Design"><option>1</option><option>2</option><option>3</option><option>4</option><option selected>5</option><option>6</option><option>7</option><option>8</option><option>9</option><option>10</option></select> <select name=nFeatures id="vote-rate-features" title="Features"><option>1</option><option>2</option><option>3</option><option>4</option><option selected>5</option><option>6</option><option>7</option><option>8</option><option>9</option><option>10</option></select>

Basically on those phones, the whole HTML code is in a single line without line breaks. But on my phone it contains line breaks and I can split those lines using Regex.Split(CRLF,html).

I download the file using HttpClient and read the contents in the following way

B4X:

TextReader1.Initialize(File.OpenInput(File.DirInternalCache, "page.html"))
result=TextReader1.ReadAll

Since that returns this string without any line breaks, I also tried this

B4X:

result=File.ReadString(File.DirInternalCache, "page.html")

But the result is still the same. On certain phones the entire HTML source code is in a single line without line break. How can I convert it into normal form, from which I can split the lines using Regex.Split(CRLF,html)?

margret · Dec 18, 2012

You can try:

B4X:

Part1=htmlstr.SubString2(0, htmlstr.IndexOf("</select>")+8)
Part2=htmlstr.SubString2(htmlstr.IndexOf("</select>")+8, htmlstr.Length)

or

B4X:

New = htmlstr.SubString2(0, htmlstr.IndexOf("</select>")+8) & CRLF
New = New & htmlstr.SubString2(htmlstr.IndexOf("</select>")+8, htmlstr.Length)

Inman · Dec 18, 2012

Thanks margret, that would work in this case. But there is more to this HTML code. I only posted a couple of lines from it. I am hoping we can find a generic solution that would parse the line break correctly.

I was wondering if this has something to do with character encoding.

margret · Dec 18, 2012

What about this:

B4X:

New = htmlstr.Replace("</select> ", "</select>" & CRLF)

Inman · Dec 18, 2012

In case of <select> tags, yes. And there are many too. But there are also other stuff like <tr>, <td>, <a> etc... And unfortunately this can change as the pages are dynamic. Which is why we need a generic solution.

margret · Dec 18, 2012

OK, so if the code that is coming in is missing the LF and there are no standard identifiers in the string, I don't know what you should do.

About the only other thing I can think of might be to check for the beginning of:

"<select "

and if it's not the first one, add a CRLF to the beginning.

warwound · Dec 18, 2012

I think your problem lies elsewhere.
HTML requires no line breaks or formatting to be valid.

Try validating the original webpage here: The W3C Markup Validation Service, does it validate?

If so now try to validate your parsed version of the page that has no line breaks - it should still be valid.

You could write an entire webpage as a single line and it will validate if the HTML contains no errors.
(And lack of nicely formatted HTML with no line breaks is NOT an error).

Can you try to load the version of the webpage (with no line breaks - the version that does not render correctly) into a desktop browser and look for other reasons why it fails to render?
If you load it into Chrome or newer versions of IE then you can use the built in browser debugger to find what's going wrong.

Martin.

Inman · Dec 18, 2012

Yes the page is getting validated. My issue is that I parse this web page line by line. So it is crucial that I should be able to split this source code into single lines.

But what I don't understand is why does the source code come in beautifully formatted HTML with line breaks and all, in majority of phones, including mine, while on some phones all the HTML code is in a single line? It is not Android version specific because some of the users who reported are using Android 2.3.x as well as 4.x.. It is not manufacturer specific as well.

I am thinking of 2 possibilities

1) For some reason the HttpClient is downloading the web page without line breaks
2) The downloaded file does contain line breaks but the way in which the file is read by the app (either with TextReader or File.ReadString) results in the omission of line break character

I am trying to get one of the users to mail me the downloaded HTML file so that we can confirm. I am hoping No. 2 is the reason. Again it is a mystery why it is happening only on certain phones. If the problem is with character encoding of the downloaded file, could it be fixed with TextReader.Initialize2(InputStream,Encoding)?

citywest · Dec 18, 2012

Inman as Warwound points out linefeeds, carriage returns etc are not required by HTML parsers and are not part of the spec.

One solution is to zip them up at the server end, download to the client, unzip them, process the html string in whatever way you require and then pass them on to the browser.

Mark S.

mc73 · Dec 18, 2012

Are you creating these pages or you just parse?

Inman · Dec 19, 2012

I am downloading these pages (and then parse) from a public website, which is why I don't have control over server's end.

citywest · Dec 19, 2012

Inman,

OK so you have no control over the server end, nor anything in between.

Another option is to add back the "missing" detail. The following javascript I've used from time to time will add crlf to each tag set.

It shouldn't take much to convert to B4A. Although you may need to remove some tags from the arrays to achieve your desired results

Cheers,

Mark S.

B4X:

function buildReadableHTML(HtmlString) {
   var readableHTML = HtmlString;
   var crlf = '\r\n';
   
   var headerTags = [ "<html","</html>","</head>","<title","</title>","<meta","<link","<style","</style>","</body>" ];
   for ( i = 0; i < headerTags.length; ++i) {
       var header = headerTags[i];
       readableHTML = readableHTML.replace(new RegExp(headerTags, 'gi'), crlf + header);
   }

   var bodyTags = [ "</div>","</span>","</form>","</fieldset>","<br>","<br />","<hr","<pre","</pre>","<blockquote","</blockquote>","<ul","</ul>","<ol","</ol>","<li","<dl","</dl>","<dt","</dt>","<dd","</dd>","<\!--","<table","</table>","<caption","</caption>","<th","</th>","<tr","</tr>","<td","</td>","<script","</script>","<noscript","</noscript>" ];
   for ( i = 0; i < bodyTags.length; ++i ) {
       var body = bodyTags[i];
       readableHTML = readableHTML.replace(new RegExp(body,'gi'), crlf + body);
   }
   
   var formTags = [ "<label","</label>","<legend","</legend>","<object","</object>","<embed","</embed>","<select","</select>","<option","<option","<input","<textarea","</textarea>" ];
   for ( i = 0; i < formTags.length; ++i ) {
       var form = ftags[i];
       readableHTML = readableHTML.replace(new RegExp(form,'gi'), crlf + form);
   }

   var xtraTags = ["<body","<head","<div","<span","<p","<form","<fieldset"];
   for ( i = 0; i <xtraTags.length; ++i) {
   var extra = xtraTags[i];
       readableHTML = readableHTML.replace(new RegExp(extra,'gi'), crlf + crlf + extra);
   }
   return readableHTML;
};

Inman · Dec 19, 2012

Thanks Mark. Looks like I will have to use your logic.

I got the downloaded HTML file from one of my users who had this issue. And then I downloaded the same file in my phone. His file contains 47 lines, while my file contains 597 lines.

So one thing is confirmed. There is nothing wrong with TextReader or any other file reading mechanism. The problem is entirely with the way HttpClient downloads a file. On certain phones it downloads files without line breaks on each line, while on most phones it downloads with all line breaks.

I wonder if some parameter or setting could be enabled in HttpClient or HttpRequest to make sure it downloads the file with all line break characters.

bluejay · Dec 19, 2012

Note that some mobile networks use Web Accelerators (also known as PEP - Performance Enhancing Proxy).

So anything that is not part of the HTML spec could get stripped out.

Server web pages can also get optimised on the way to the mobile eg reordering pages to display text before graphics or encoding large graphics for small screens.

So what is sent by the server could be different to what the mobile receives.

bluejay

Inman · Dec 19, 2012

If that is the case, I think it is better to implement something similar to Mark's function.

warwound · Dec 19, 2012

Ahhh i just re-read the thread...

The problem is not that the WebView won't render HTML that contains no line breaks but that your HTML parsing code expects the web page to contain line breaks!

A solution might be to parse the web page HTML using my XOM library.
(Or the B4A SaxParser).

As long as the HTML is well formed it should work and you'll have a more robust solution.

Martin.

Inman · Dec 19, 2012

Somehow I missed your library, Martin. And I was under the impression that B4A SaxParser works only with XML and nothing else.

I managed to implement a function similar to Mark's. But your library is very interesting to me as almost all my apps involve HTML parsing.

HTML parsing issue on some phones

Inman

Well-Known Member

margret

Well-Known Member

Inman

Well-Known Member

margret

Well-Known Member

Inman

Well-Known Member

margret

Well-Known Member

warwound

Expert

Inman

Well-Known Member

citywest

Member

mc73

Well-Known Member

Inman

Well-Known Member

citywest

Member

Inman

Well-Known Member

bluejay

Active Member

Inman

Well-Known Member

warwound

Expert

Inman

Well-Known Member