Parsing html table to extract values

aletrotta

New Member
Licensed User
Longtime User
Hi Everyone,

I'm looking to parse an html table from a page to extract some values.
Page: .:: Banco De La Nación Argentina ::.

Sadly, the table is written in a single line in order to use code like FlickrViewer example, and I have no knowledge of the power/limits of Regex function.

Also I've found a PHP code that it does exactly what I would like to do with B4A using the preg_match function: Awakenings - Obtener cotización del dólar y euro

I would like to ask anybody if exists a function or library like preg_match in B4A?
Also would be much appreciated any help you may gave to me.

Thanks in advance,
Alejandro.
 

Erel

B4X founder
Staff member
Licensed User
Longtime User
Here:
B4X:
Sub ParseTable(s As String)
'<td class="linksazul" align="left">Coronas Suecas</td>
'<td class="linksazul" align="center">63.4090</td>
'<td class="linksazul" align="center">64.4845</td>
'<td class="linksazul" align="center"></td>
   Dim m As Matcher
   m = Regex.Matcher( _
"<td class=~linksazul~ [^>]+>([^<]+)</td>[^>]+>([^<]+)</td>[^>]+>([^<]+)</td>".Replace("~", "\" & QUOTE), s)
   Do While m.Find
      Log("col1=" & m.Group(1))
      Log("col2=" & m.Group(2))
      Log("col3=" & m.Group(3))
      Log("*********************")
   Loop
End Sub
 
Upvote 0

jalle007

Active Member
Licensed User
Longtime User
I have assimilate problem here. Trying to parse table from this link
SIA.ba - mobile using regex but no success. Tried this :

B4X:
Sub Parse(html As String)
'trying to remove linebreaks
html = html.Replace(CRLF,"")

    Dim m As Matcher
    m = Regex.Matcher(table_pattern, html)
    Do While m.Find
   Log(m.GroupCount)
   Log(m.Match)
        Log("col1=" & m.Group(1))
    Loop

'ABOVE CODE CAN'T FIND TABLE although there is just single table in a page

'but when using this sample table it works
'html= "asdasasd<table>something</table>asdadasd"

End If
End Sub
 
Upvote 0

jalle007

Active Member
Licensed User
Longtime User
Hi Erel

YEs there is table in that page with city names i dont know why cant you find it.
SIA.ba - mobile

here is the code:

B4X:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
   <!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

   <meta name="viewport" content="width=device-width,initial-scale=1.0" />
   <meta name="format-detection" content="telephone=no">

   <title>SIA.ba - mobile</title>

   <style type="text/css">
      body{padding:0;margin:0;font-family: Helvetica,sans-serif;background:#F2F2EF;}
      hr { border:0; width:100%; height:1px; background-color:gray}

      .header, .footer {
      width: 100%;
      }

      img{border:0}

      .sidebar {
      display: none;
      }
#desnomenu{margin:0;padding:0;}
#desnomenu ul{
list-style-type: none;
margin-left:5;
padding:0;
}

#desnomenu li{
   margin:0 2px 0 0;
   padding:0;
   line-height: 1.5em;
   clear:left;
   font-size:15px;
}

#desnomenu li.m{
   margin:0 2px 0 0;
   padding:0;
   line-height: 1.5em;
   clear:left;
   font-size:14px;
}

#desnomenu a{
float:left;
color: #333;
margin:6px 0 0 0px;
text-decoration:none;
letter-spacing: 1px;
width:100%;
height:44px;
border-bottom:1px solid #c0c0c0;
}


#desnomenu li.m a{
height:auto;
}

#desnomenu span b{
background:#FF9933;
color:white;
padding:2px;
-moz-border-radius: 4px; 
font-size:10px;
}

#desnomenu a:hover{
color:#2E8AA2;
}

   </style>
   
   </head>
   
   <body style="font-family:Arial, Helvetica, sans-serif; font-size:12px">
   
      <div style="width:100%;height:10px;overflow:hidden;background:#22303C;border-bottom:3px solid #1C7F99;"></div>
      <div style="padding:10px;background:white;">
         <a href="index.php?lang=bos"><img alt="Poetna" src="../logo.gif"></a>
         <br><br>
         
                              <a href="/mobile/letovi.php?vrsta=odlasci&lang=eng" style="float:right;color:#333;">English</a>
               
      </div>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<div id="desnomenu">
      
      <table>
               <tr>
            <td><a href="letovi.php?vrsta=odlasci&dest=Ancona&lang=bos">Ancona</a></td>
         </tr>
                  <tr>
            <td><a href="letovi.php?vrsta=odlasci&dest=Beograd&lang=bos">Beograd</a></td>
         </tr>
                  <tr>
            <td><a href="letovi.php?vrsta=odlasci&dest=Istanbul&lang=bos">Istanbul</a></td>
         </tr>
                  <tr>
            <td><a href="letovi.php?vrsta=odlasci&dest=Ljubljana&lang=bos">Ljubljana</a></td>
         </tr>
                  <tr>
            <td><a href="letovi.php?vrsta=odlasci&dest=Munich&lang=bos">Munich</a></td>
         </tr>
                  <tr>
            <td><a href="letovi.php?vrsta=odlasci&dest=Vienna&lang=bos">Vienna</a></td>
         </tr>
                  <tr>
            <td><a href="letovi.php?vrsta=odlasci&dest=Zagreb&lang=bos">Zagreb</a></td>
         </tr>
                  <tr>
            <td><a href="letovi.php?vrsta=odlasci&dest=Zurich&lang=bos">Zurich</a></td>
         </tr>
               </table>
      
      </div>

   <br clear="all">


   <div style="display:block; width:100%; text-align:center;background:#22303C;border-top:3px solid #1C7F99;color:white;margin-top:10px;padding:5px 0 5px 0;">
      <a href="index.php?lang=bos" style="color:white;">Početna</a>
      <br><br>
      <span style="font-style:italic; font-size:10px">
         (C)2013 Međunarodni Aerodrom Sarajevo <br> Sektor za informatičke i komunikacione tehnologije      </span>
   </div>

   </body>
   
</html>
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
Here is a better approach based on the new JTidy library:
B4X:
Sub Process_Globals
   Dim sax As SaxParser
   Type ParsedItem(Link As String, Country As String)
   Dim items As List
End Sub
Sub Globals

End Sub

Sub Activity_Create(FirstTime As Boolean)
   Dim t As Tidy
   t.Initialize
   t.Parse(File.OpenInput(File.DirAssets, "index.html"), File.DirInternal, "temp.xml")
   sax.Initialize
   Dim In As InputStream = File.OpenInput(File.DirInternal, "temp.xml")
   items.Initialize
   sax.Parse(In, "sax")
   In.Close
   Log(items)
End Sub

Sub sax_StartElement (Uri As String, Name As String, Attributes As Attributes)
   If sax.Parents.IndexOf("table") <> -1 AND Name = "a" Then
      Dim pi As ParsedItem
      pi.Initialize
      pi.Link = Attributes.GetValue2("", "href")
      items.Add(pi)
   End If
End Sub

Sub sax_EndElement (Uri As String, Name As String, Text As StringBuilder)
   If sax.Parents.IndexOf("table") <> -1 AND Name = "a" Then
      Dim pi As ParsedItem = items.Get(items.Size - 1) 'get the last item
      pi.Country = text
   End If
End Sub

First we create a temporary XML file from the HTML and then we parse it.
 
Upvote 0

jalle007

Active Member
Licensed User
Longtime User
Thank you Erel this works fine.

I have sax event where flights list is populated with flight details

B4X:
Sub saxFlights_EndElement (Uri As String, Name As String, Text As StringBuilder)
     If saxFlights.Parents.IndexOf("tr") > -1 AND Name="td" Then
   
    Select Case c
       Case 0: flight.Grad=Text
       Case 2: flight.BrojLeta=Text
       Case 4: flight.Kompanija=Text
       Case 6: flight.TipAviona=Text
       Case 8: flight.Vrijeme=Text
       Case 10: flight.Status=Text
    End Select
    
   c=c+1
    If c=12 Then 
       c = 0 
       Flights.Add(flight) 
       flight.Initialize
      counter=counter+1
    End If
    End If
End Sub

In first loop everything is fine and Flights.Add(flight) is added with new flight.
B4X:
(ArrayList) [[BrojLeta=SOP 4121, Grad=Beograd, Vrijeme=16:35
, Kompanija=SOLINAIR LTD, Status=, TipAviona=SF34
, IsInitialized=true]]

Problem is that in next loop when I need to have another flight added , old one is duplicated :
B4X:
(ArrayList) [[BrojLeta=SOP 4121, Grad=Beograd, Vrijeme=16:35
, Kompanija=SOLINAIR LTD, Status=, TipAviona=SF34
, IsInitialized=true], [BrojLeta=SOP 4121, Grad=Beograd, Vrijeme=16:35
, Kompanija=SOLINAIR LTD, Status=, TipAviona=SF34
, IsInitialized=true]]
so here I have 2 duplicate flights instead of new one.

Question is why this part of code
B4X:
Flights.Add(flight) 
flight.Initialize
does not add new object to the list but overwrites old one ?
 
Upvote 0

jalle007

Active Member
Licensed User
Longtime User
B4X:
Sub Process_Globals
   'These global variables will be declared once when the application starts.
   'These variables can be accessed from all modules.
       Dim c As Int=0: Dim counter As Int=0 
   Type FlightData(What As String, Value As String)
   Type FlightInfo(Grad As String, BrojLeta As String, Kompanija As String, TipAviona As String , Vrijeme As String, Status As String)
   Type CityInfo(city As String , Url As String)
   
   
   Dim city1 As CityInfo: city1.Initialize
   Dim flight As FlightInfo: flight.Initialize
   
  
   Dim Cities, Flights As List 
   Flights.Initialize: Cities.Initialize
   
         
End Sub

I have it initialized here. And yes that's what is happening new "flight" object is not initialized.

Only thing is that I dont understand if I should declare FlightInfo as a Type or a Class ?
 
Upvote 0

jalle007

Active Member
Licensed User
Longtime User
You were right agraham
I just needed to reDim flight
B4X:
   Dim flight As FlightInfo: flight.Initialize

Now another problem appeared:
Since I am running multiple downloads with HTTPJob
I got Java exception somewhere in HTPJOb module
B4X:
Sub GetInputStream As InputStream
   Dim In As InputStream
   In = File.OpenInput(HttpUtils2Service.TempFolder, taskId)
   Return In
End Sub

B4X:
httpjob_getinputstream (B4A line: 124)


In = File.OpenInput(HttpUtils2Service.TempFolder, taskId)
java.io.FileNotFoundException: /data/data/b4a.example/cache/3: open failed: ENOENT (No such file or directory)


   at libcore.io.IoBridge.open(IoBridge.java:416)
   at java.io.FileInputStream.<init>(FileInputStream.java:78)
   at anywheresoftware.b4a.objects.streams.File.OpenInput(File.java:197)
   at b4a.example.httpjob._getinputstream(httpjob.java:209)
   at b4a.example.main._jobdone(main.java:607)
   at java.lang.reflect.Method.invokeNative(Native Method)
   at java.lang.reflect.Method.invoke(Method.java:511)
   at anywheresoftware.b4a.BA.raiseEvent2(BA.java:167)
   at anywheresoftware.b4a.keywords.Common$4.run(Common.java:885)
   at android.os.Handler.handleCallback(Handler.java:725)
   at android.os.Handler.dispatchMessage(Handler.java:92)
   at android.os.Looper.loop(Looper.java:137)
   at android.app.ActivityThread.main(ActivityThread.java:5191)
   at java.lang.reflect.Method.invokeNative(Native Method)
   at java.lang.reflect.Method.invoke(Method.java:511)
   at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:799)
   at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:566)
   at dalvik.system.NativeStart.main(Native Method)
Caused by: libcore.io.ErrnoException: open failed: ENOENT (No such file or directory)
   at libcore.io.Posix.open(Native Method)
   at libcore.io.BlockGuardOs.open(BlockGuardOs.java:110)
   at libcore.io.IoBridge.open(IoBridge.java:400)
   ... 17 more
java.io.FileNotFoundException: /data/data/b4a.example/cache/3: open failed: ENOENT (No such file or directory)
 
Upvote 0
Top