Android Question Help with RegEx and

ivan.tellez

Active Member
Licensed User
Longtime User
Hi, Asuming a table With this data:

B4X:
<tr data-ri="0" class="ui-widget-content ui-datatable-even" role="row">
        <td role="gridcell"><span style="font-weight: bold;">XXXX:</span></td>
        <td role="gridcell" style="text-align:left;">YYYY</td></tr>
    
    <tr data-ri="1" class="ui-widget-content ui-datatable-odd" role="row">
        <td role="gridcell"><span style="font-weight: bold;">XXXX:</span></td>
        <td role="gridcell" style="text-align:left;">YYYY</td></tr>
        
    <tr data-ri="2" class="ui-widget-content ui-datatable-even" role="row">
        <td role="gridcell"><span style="font-weight: bold;">XXXX:</span></td>
        <td role="gridcell" style="text-align:left;">YYYY</td></tr>

How can a get 2 Capturing Groups with XXXX and YYYY?

So far I have only partial success:

B4X:
        Dim Exp As String = $"<span style="font-weight: bold[^>]+>([^<]+)|<td role="gridcell"[^>]+>([^<]+)"$
        Dim m As Matcher
        m = Regex.Matcher(Exp, Respuesta)

           Do While m.Find
              Log("G1=" & m.Group(1))
              Log("G2=" & m.Group(2))
              Log("*********************")
           Loop

Shows:

G1=XXXX :
G2=null
*********************
G1=null
G2=YYYY
*********************
G1=XXXX :
G2=null
*********************
G1=null
G2=YYYY
*********************
G1=XXXX :
G2=null
*********************
G1=null
G2=YYYY
*********************


(XXXX and YYYY are used just as an example. Real data is unique, but in pairs)

Any idea in how to make it:

G1=XXXX :
G2=YYYY
*********************
G1=XXXX :
G2=YYYY
*********************
G1=XXXX :
G2=YYYY
*********************

Thaks
 

Roycefer

Well-Known Member
Licensed User
Longtime User
Do this to your HTML String:
B4X:
Dim deflatedString As String = htmlString.Replace(CRLF,"")
Then try this pattern:
B4X:
Dim pattern As String = "<span.*?>(.*?)</span>.*?<td.*?>(.*?)</td>"
Dim mchr As Matcher = Regex.Matcher(pattern, deflatedString)
Do While mchr.Find
   Log("G1=" & mchr.Group(1))
   Log("G2=" & mchr.Group(2))
   Log("*********************")
Loop
I have only tested this pattern in JavaScript, not B4A, but I don't think the differences in the Regex engines should break this. The trick to this pattern is the question marks. That makes what comes before not greedy.

For example, .*?> will keep eating characters until it hits > even though > should satisfy the .* (the .* is not greedy because of that question mark).
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
B4X:
Dim tidy As Tidy
tidy.Initialize
Dim s As String = $"<tr data-ri="0" class="ui-widget-content ui-datatable-even" role="row">
    <td role="gridcell"><span style="font-weight: bold;">XXXX:</span></td>
    <td role="gridcell" style="text-align:left;">YYYY</td></tr>

<tr data-ri="1" class="ui-widget-content ui-datatable-odd" role="row">
    <td role="gridcell"><span style="font-weight: bold;">XXXX:</span></td>
    <td role="gridcell" style="text-align:left;">YYYY</td></tr>
 
<tr data-ri="2" class="ui-widget-content ui-datatable-even" role="row">
    <td role="gridcell"><span style="font-weight: bold;">XXXX:</span></td>
    <td role="gridcell" style="text-align:left;">YYYY</td></tr>"$
Dim in As InputStream
Dim b() As Byte = s.GetBytes("utf8")
in.InitializeFromBytesArray(b, 0, b.Length)
tidy.Parse(in, File.DirInternal, "1.xml")
Dim xm As Xml2Map
xm.Initialize
Dim m As Map = xm.Parse(File.ReadString(File.DirInternal, "1.xml"))
'   Dim jg As JSONGenerator 'convert to a nice string to better understand the
'   jg.Initialize(m)
'   Log(jg.ToPrettyString(4))

The json output is:
B4X:
{
    "html": {
        "head": {
            "meta": {
                "Attributes": {
                    "name": "generator",
                    "content": "HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net"
                },
                "Text": ""
            },
            "title": ""
        },
        "body": {
            "table": {
                "tr": [
                    {
                        "td": [
                            {
                                "Attributes": {
                                    "role": "gridcell"
                                },
                                "span": {
                                    "Attributes": {
                                        "style": "font-weight: bold;"
                                    },
                                    "Text": "XXXX:"
                                }
                            },
                            {
                                "Attributes": {
                                    "role": "gridcell",
                                    "style": "text-align:left;"
                                },
                                "Text": "YYYY"
                            }
                        ],
                        "Attributes": {
                            "role": "row",
                            "class": "ui-widget-content ui-datatable-even",
                            "data-ri": "0"
                        }
                    },
                    {
                        "td": [
                            {
                                "Attributes": {
                                    "role": "gridcell"
                                },
                                "span": {
                                    "Attributes": {
                                        "style": "font-weight: bold;"
                                    },
                                    "Text": "XXXX:"
                                }
                            },
                            {
                                "Attributes": {
                                    "role": "gridcell",
                                    "style": "text-align:left;"
                                },
                                "Text": "YYYY"
                            }
                        ],
                        "Attributes": {
                            "role": "row",
                            "class": "ui-widget-content ui-datatable-odd",
                            "data-ri": "1"
                        }
                    },
                    {
                        "td": [
                            {
                                "Attributes": {
                                    "role": "gridcell"
                                },
                                "span": {
                                    "Attributes": {
                                        "style": "font-weight: bold;"
                                    },
                                    "Text": "XXXX:"
                                }
                            },
                            {
                                "Attributes": {
                                    "role": "gridcell",
                                    "style": "text-align:left;"
                                },
                                "Text": "YYYY"
                            }
                        ],
                        "Attributes": {
                            "role": "row",
                            "class": "ui-widget-content ui-datatable-even",
                            "data-ri": "2"
                        }
                    }
                ]
            }
        }
    }
}

It shouldln't be difficult to get all values from here.
And you can use this tool to help you: https://b4x.com:51041/json/index.html
 
Upvote 0

ivan.tellez

Active Member
Licensed User
Longtime User
Do this to your HTML String:
Then try this pattern:
B4X:
Dim pattern As String = "<span.*?>(.*?)</span>.*?<td.*?>(.*?)</td>"
I have only tested this pattern in JavaScript, not B4A, but I don't think the differences in the Regex engines should break this.

Wow, thats great!

I just did a little adjustment

  • To use in B4A, I just had to escape the /
  • Add a colon to remove it from the result group
  • And, before the first TR, there was another <SPAN>, so, I had to add more text at the start

B4X:
Dim Exp As String = $""gridcell"><span.*?>(.*?):<\/span>.*?<td.*?>(.*?)<\/td"$


Many thanks and best regards.
 
Upvote 0
Top