Android Question (solved) Converting single line to proper html/XML to be parsed

only1jake

Member
Licensed User
Longtime User
I currently have a working program that scrapes a website for details.
Downloads the page using an httpjob then saves certain lines to a file to be parsed using the saxparser.
It has been working fine till I try to parse this:
HTML:
<table class="calendarTable" border="0" cellspacing="0" cellpadding="0">
<tr>
<th>Monday</th>
<th>Tuesday</th>
<th>Wednesday </th>
<th>Thursday </th>
<th>Friday </th>
<th>Saturday </th>
<th>Sunday </th>
</tr>
<tr><td class="lastmonth"><div class="calendarTableDay">26</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">27</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">28</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">29</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">30</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">31</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">1</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11247/result-detail.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p></td></tr><tr><td class="thismonth"><div class="calendarTableDay">2</div><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11248/result-detail.aspx">Taranaki GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">3</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11249/result-detail.aspx">Otago GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11250/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">4</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11251/result-detail.aspx">Palmerston North GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">5</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11252/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11253/result-detail.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">6</div><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11255/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11254/result-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">7</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">8</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11256/result-detail.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p></td></tr><tr><td class="thismonth"><div class="calendarTableDay">9</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11257/result-detail.aspx">Palmerston North GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">10</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11258/result-detail.aspx">Southland GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11259/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">11</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11260/result-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">12</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11261/result-detail.aspx">Waikato GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11262/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">13</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11263/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11264/result-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">14</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">15</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11265/result-detail.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p></td></tr><tr><td class="thismonth"><div class="calendarTableDay">16</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11266/result-detail.aspx">Palmerston North GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">17</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11267/result-detail.aspx">Otago GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11268/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">18</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11269/result-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">19</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11270/result-detail.aspx">Waikato GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11271/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">20</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11272/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="fields"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11273/field-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">21</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">22</div><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11274/field-detail.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11275/field-detail.aspx">Ashburton GRC</a><br/><br/></p><p style="height: 3px;"></p></td></tr><tr><td class="thismonth"><div class="calendarTableDay">23</div><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11276/field-detail.aspx">Palmerston North GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">24</div><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11277/field-detail.aspx">Southland GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11278/field-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">25</div><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11279/field-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">26</div><p class="schedule"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11267/meeting-schedule.aspx">Waikato GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="schedule"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11268/meeting-schedule.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">27</div><p class="schedule"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11269/meeting-schedule.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="schedule"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11270/meeting-schedule.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">28</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">29</div><p class="schedule"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11271/meeting-schedule.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p></td></tr><tr><td class="thismonth"><div class="calendarTableDay">30</div><p class="schedule"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11272/meeting-schedule.aspx">Palmerston North GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">1</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">2</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">3</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">4</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">5</div><p class="">&nbsp;</p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">6</div><p class="">&nbsp;</p><p style="height: 3px;"></p>
</table>

I am having problems because I want to parse the line starting with "<tr><td class="lastmonth"><div class="calendarTableDay"". This line continues all the way to the final "</p>" before the last line "</table>"

I successfully write it to the file and then when trying to parse using "parser.Parse(File.OpenInput(File.DirRootExternal, "index.html"), "parser")" I get the below error, relating that there is problems with the xml/html.

Any tips into parsing this massive long line or to converting it into a proper layout to be parsed would be helpful. Thanks!


The error I receive:

B4X:
org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 84: undefined entity


    at org.apache.harmony.xml.ExpatParser.parseFragment(ExpatParser.java:515)
    at org.apache.harmony.xml.ExpatParser.parseDocument(ExpatParser.java:474)
    at org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:321)
    at org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:279)
    at anywheresoftware.b4a.objects.SaxParser.parse(SaxParser.java:80)
    at anywheresoftware.b4a.objects.SaxParser.Parse(SaxParser.java:73)
    at b4a.jtidy.main._jobdone(main.java:438)
    at java.lang.reflect.Method.invokeNative(Native Method)
    at java.lang.reflect.Method.invoke(Method.java:511)
    at anywheresoftware.b4a.BA.raiseEvent2(BA.java:174)
    at anywheresoftware.b4a.keywords.Common$5.run(Common.java:957)
    at android.os.Handler.handleCallback(Handler.java:725)
    at android.os.Handler.dispatchMessage(Handler.java:92)
    at android.os.Looper.loop(Looper.java:213)
    at android.app.ActivityThread.main(ActivityThread.java:5092)
    at java.lang.reflect.Method.invokeNative(Native Method)
    at java.lang.reflect.Method.invoke(Method.java:511)
    at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:797)
    at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:564)
    at dalvik.system.NativeStart.main(Native Method)
org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 84: undefined entity
 

only1jake

Member
Licensed User
Longtime User
Ahhh of course. That didnt come to me Erel.

Soooo, in the end I used Jtidy to convert the code into more proper xml from the single line. However this came out with the code and tags split up into multiple and not correct lines. I then used the reader and writer to take the lines and details that I wanted, and using if statements; to make the right format for the parser. Now I can use the parser to collect the start and end elements I require. Took another 100 lines (lol)
Thanks again :cool:
 
Upvote 0
Top