Android Question HTML Unicode Decimal Codes

bocker77

Active Member
Licensed User
I am using library OKHttpUtils2 to retrieve HTML from a web page. I parse out some text from the page and feed it to the text-to-speech engine. A problem arises when the text contains Unicode decimal codes where the engine says "hash" and the number. For now when I find this happening I replace the code to something similar in ASCII or change it to its meaning. Of course I will not be able to get everything so I was wondering if there is anyway to address this problem another way. I know that it would be virtually impossible to replace characters to something that makes sense. For instance I replace the Unicode &#188 which is the 1/4 symbol to " one fourth ". Any advice with this issue would be appreciated.
 

bocker77

Active Member
Licensed User
Erel, thanks.

How does it handle the Unicode symbols as in my example above? Also I am using regex to remove the HTML code that is left in the text that I am parsing out. Do I use this in its place? I downloaded both files but am not sure how to implement it into my project. Since this is a class I thought that I could just add an existing module but after looking at the file in Notepad++ it looks as if it is compiled or something like that. I am still coding B4A. Will this only work with B4X code? Sorry for my ignorance and so many questions.
 
Upvote 0

bocker77

Active Member
Licensed User
OK I figured out that it goes into the Add Libraries. Now to view some code examples to figure out how to use it. This looks just what I need. If you could answer the other question I would be grateful about how it handles Unicode symbols.
 
Upvote 0

bocker77

Active Member
Licensed User
I am getting this error while trying to use MiniHtmlParser. I am new to this library and its use so there is probably something that I am doing wrong.

Error occurred on line: 276 (MiniHtmlParser)
java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.Object anywheresoftware.b4a.objects.collections.List.Get(int)' on a null object reference
at com.kangaroosoftware.historicalmarkers.minihtmlparser._gettextfromnode(minihtmlparser.java:203)
at java.lang.reflect.Method.invoke(Native Method)
at anywheresoftware.b4a.shell.Shell.runMethod(Shell.java:732)
at anywheresoftware.b4a.shell.Shell.raiseEventImpl(Shell.java:348)
at anywheresoftware.b4a.shell.Shell.raiseEvent(Shell.java:255)
at java.lang.reflect.Method.invoke(Native Method)
at anywheresoftware.b4a.ShellBA.raiseEvent2(ShellBA.java:144)
at anywheresoftware.b4a.BA$2.run(BA.java:387)
at android.os.Handler.handleCallback(Handler.java:883)
at android.os.Handler.dispatchMessage(Handler.java:100)
at android.os.Looper.loop(Looper.java:241)
at android.app.ActivityThread.main(ActivityThread.java:7617)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:941)

The HTML code that I am interested in parsing is so.
...
<script>
var plainText = document.getElementById('inscription1').innerText;
document.getElementById('inscription1').innerHTML = plainText;
</script>
<span id=speakbutton1 style='cursor:pointer;' onclick="speak(document.getElementById('inscription1').innerHTML,1,'en')"><img src=SpeakIcon.png title='Click to hear the inscription.'></span> &nbsp;The Morgan Raiders arrived in Versailles at 1 p.m., Sunday, July 12, 1863, and proceeded to rob the county treasury and to obtain fresh horses. The Raiders also captured and paroled 300 local men, who had assembled in hopes of defending Versailles. One of Morgan's men robbed the local Masonic Lodge of its jewelry. Morgan, a staunch Mason, ordered the items returned and the thief reprimanded.<br><br>

The Raiders left at 4 p.m. in several directions with orders to converge at present-day St. Paul. Union General Edward H. Hobson and the pursuing Union Cavalry arrived Sunday evening, only four hours behind Morgan. That night the Union forces camped west of Versailles. (Note the Historical Markers on the courthouse lawn.)<br>&nbsp;<br><span class=sectionhead> ...

My code is thus
B4X:
    strHTML = File.ReadString(File.DirInternal, "TheHTML.txt")
    Dim root As HtmlNode = parser.Parse(strHTML)
    Dim span As HtmlNode = parser.FindNode(root, "span", parser.CreateHtmlAttribute("id", "speakbutton1"))
    Log(parser.GetTextFromNode(span, 0))
 
Upvote 0

bocker77

Active Member
Licensed User
Here is the full HTML source code attached. I saved it as a text file using NotePad++ because for some reason the Attach files below didn't show the saved html file.
 

Attachments

  • view-source_https___www.hmdb.org_m.txt
    27.1 KB · Views: 36
Upvote 0

bocker77

Active Member
Licensed User
I might add that the website has its own clickable control to read the text to you. It looks as if B4A doesn't have the capability of being able to simulate the click so I decided to parse out the text. After you pointed me to this library and after looking further into the HTML code I was getting the the wrong text string and not the text that is read. That text might have Unicode in it where it looks as if the read text doesn't. I would rather use this library instead of me parsing out the text though. It looks to be not only easier but less prone to getting any errors if they so happen to change the layout of the sites HTML code.
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
This html will not work with MiniHtmlParser, however you can still use its unescaping features:
B4X:
Dim s As String = File.ReadString("C:\Users\H\Downloads\view-source_https___www.hmdb.org_m.txt", "")
Dim parser As MiniHtmlParser
parser.Initialize
Dim i1 As Int = s.IndexOf("<span id=speakbutton1")
Dim i2 As Int = s.IndexOf2("</span>", i1 + 1)
Dim i3 As Int = s.IndexOf2("<div", i2 + 1)
Dim text As String = s.SubString2(i2 + 7, i3)
Log(parser.UnescapeEntities(text))
 
Upvote 0
Top