Android Tutorial [B4X] Regular expressions (RegEx) tutorial

Erel

Administrator
Staff member
Licensed User
Regular expressions are very powerful and make complicate parsing challenges much easier.
This short tutorial will describe the usage of regular expressions in Basic4android.
If you are not familiar with regular expressions you can find many good tutorials online. I recommend you to start with this one: Regular Expression Tutorial - Learn How to Use Regular Expressions

Basic4android uses Java regular expression engine. See this page for specific nuances related to this engine: Pattern (Java Platform SE 6)

Regular expressions methods in Basic4android start with the predefined object named Regex. You can write Regex followed by a dot to see the available methods.

All methods accept a pattern string. This is the regular expression pattern. Note that internally the compiled patterns are cached. So there is no performance loss when using the same patterns multiple times.

For each method there are two variants. The difference between the variants is that the second one receives an 'options' integer that affects the engine behavior. For now there are two option, CASE_INSENSITIVE and MULTILINE. CASE_INSENSITIVE makes the pattern matching be case insensitive. MULTILINE changes the string anchors ^ and & match the beginning and end of each line instead of the whole string.
Both options can be combined by calling Bit.Or(Regex.MULTILINE, Regex.CASE_INSENSITIVE).

Matching the whole string
IsMatch and IsMatch2 are good to validate user input. The result of these methods is true if the whole string matches the pattern.
For example the following code checks if a date string is formatted in a format similar to: 12-31-2010
B4X:
    Log(Regex.IsMatch("\d\d-\d\d-\d\d\d\d", "11-15-2010")) 'True
    Log(Regex.IsMatch("\d\d-\d\d-\d\d\d\d", "12\31\2010")) 'False
This pattern will also match the string "99-99-9999".

Splitting text
Split and Split2 splits a text around matches of the given pattern.
Simple case:
B4X:
Dim data As String
data = "123,432,13,4,12,534"
Dim numbers() As String
numbers = Regex.Split(",", data)
Dim l As List
l.Initialize2(numbers)
Log(l)
Lists can be easily printed with Log so we add the array to the list.
The result is:



The comma followed by a single space is part of the list formatting. The expected values were parsed.

Now if the data value was "123, 432 , 13 , 4 , 12, 534"
The result wasn't perfect:



There are extra spaces which are part of the parsed values.

We can change the pattern to match a comma or white space:
B4X:
numbers = Regex.Split("[,\s]", data)
The result is still not as we want it:



Many empty strings were added.
The correct pattern in this case is:
B4X:
numbers = Regex.Split("[,\s]+", data)
Find matches in string
Here we have a long string and we want to find all matches of a pattern in the string. We can also use capture groups to get specific parts of the match.

As an example we will find and print email addresses in text:
B4X:
Dim data As String
data = "Please contact mike@gmail.com or john@gmail.com"
Dim matcher1 As Matcher
matcher1 = Regex.Matcher("\w+@\w+\.\w+", data)
Do While matcher1.Find = True
    Log(matcher1.Match)
Loop
This code prints:
mike@gmail.com
john@gmail.com

Note that this pattern is far from being a good pattern for email validation / matching.

In the second example we will use a Matcher with capturing groups to validate a date text. The pattern is similar to the pattern in the first example with the addition of parenthesis. These parenthesis mark the groups:
B4X:
Log(IsValidDate("13-31-1212")) 'false
Log(IsValidDate("12-31-1212")) 'true

Sub IsValidDate(Date As String) As Boolean
    Dim matcher1 As Matcher
    matcher1 = Regex.Matcher("(\d\d)-(\d\d)-(\d\d\d\d)", Date)
    If matcher1.Find = True Then
        Dim days, months As Int
        months = matcher1.Group(1) 'fetch the first captured group.
        days = matcher1.Group(2) 'fetch the second captured group
        If months > 12 Then Return False
        If days > 31 Then Return False
        Return True
    Else
        Return False
    End If
End Sub
The groups feature is very useful. If you find yourself calling String.IndexOf together with String.Substring multiple times, it is a good hint that you should move to a Regex and Matcher.

Online tool to test Regex patterns: http://www.basic4ppc.com/android/forum/threads/server-regex-tool.39192/
 
Last edited:

WZSun

Member
Licensed User
Hi Erel,
Thanks for the insight... it sure does help inspired me to think harder..


Below is a quick StringParse sub that I did to retrieve a sample date




s = "12/31/2010"
s1 = StrParse(s,"/",2)
msgbox(s1,"Info") ' returns 2010


Sub StrParse(FirstStr As String, sSeparator As String, idx As Int) As String
Dim strArray() As String, l As List
strArray = Regex.Split("[" & sSeparator & "\s]", FirstStr)
l.Initialize2(strArray)
Return l.Get(idx)
End Sub



Rgds
WZSun
 

Foz

Member
Licensed User
I think I'm missing something here...

If I do a Split, into a dynamic string array, how do I then get the resulting array size?
 

Erel

Administrator
Staff member
Licensed User
B4X:
Dim arr() As String
arr = Regex.Split(...)
For i = 0 To arr.Length - 1
 Log(arr(i))
Next
 

Foz

Member
Licensed User
:sign0161:
Thank you Erel!

sigh... I was doing an inline assign which you obviously can't do, and it didn't like it and thus never showed the Length field and wouldn't compile.

One of these days I'll get my head screwed on straight...
 

ChrShe

Member
Licensed User
Some quick Regex.Matcher help...

Good day!

I've been tinkering around with the Regex.Matcher and have run into a bit of a snag that I was hoping I could get some help with.

I'm parsing a web page with the following lines:

<div class="list-animal-info-block">
<div class="list-animal-name"><a href="wsAdoptableAnimalDetails.aspx?id=13069119&css=adoptableSearch.css" >Jed</a></div>
<div class="list-animal-id">13069119</div>
<div class="list-anima-species">Dog</div>
<div class="list-animal-sexSN">Male/Neutered</div>
<div class="list-animal-breed">Terrier, American Pit Bull/Mix</div>
<div class="list-animal-age">2 years 9 months</div>
<div class="hidden">Dog Large</div>
What I need to get is the InnerText of each div line. So, for example, for the Line-animal-id, I'd like to have "13069119" returned.

Using the following, I've been able to get the matcher to find the line, but can't seem to figure out returning the portion of the line that I'm interested in.
B4X:
 Regex.Matcher("class=\""list-animal-name\""",page)
So, basically, how do I get the matcher to return the portion of the found line that I want?

Any help is greatly appreciated.
THANK YOU!!!
~Chris
 

Erel

Administrator
Staff member
Licensed User
It is better to start a new thread for such questions.

If the string is a valid XML (XHTML) then you can use an XML parser to parse it.

With Regex you need something like:
B4X:
"class=\""([^""]+)\"">([^>]+)</div>" 'group 1 will hold the class attribute and group 2 the text.
 

LucaMs

Expert
Licensed User
Regular expressions are very powerful and make complicate parsing challenges much easier...

The groups feature is very useful. If you find yourself calling String.IndexOf together with String.Substring multiple times, it is a good hint that you should move to a Regex and Matcher.

I found :)

When I met regular expressions, I quickly abandoned them.
I thought: "Too much time to learn them, I hurry faster with string functions."

This your last sentence makes me think, though.

Am I wrong or they could be very useful for creating a command line parser and for break (split, group or grrrr) HTML blocks/Tags?
 

Alberto Michelis

Active Member
Licensed User
How to check only alphabetical chars and spaces?
Alberto Michelis OK
Alberto,Michelis Wrong
Alberto2Michelis Wrong
Thanks
 

victormedranop

Well-Known Member
Licensed User
I need to parse this string "Resultado : Q;11#1;P;12#1;T;13#23;Q;14#2;Q;21#2;P;22#2;T;23#3;Q;31#3;P;32#3;T;33#9;SP;34#10;SP;35#6;Q;41#6;P;42#6;T;43#12;SP;44#11;SP;45#7;Q;51#7;P;52#7;T;53#13;SP;54#14;SP;55#20;Q;61#20;P;62#20;T;63#21;Q;71#21;P;72#21;T;73#22;C;81

the result should be

Q;11#1;
P;12#1;
T;13#23;
Q;14#2;
Q;21#2;
P;22#2;
T;23#3;
Q;31#3;
P;32#3;
T;33#9;
SP;34#10;
SP;35#6;
Q;41#6;
P;42#6;
T;43#12;
SP;44#11;
SP;45#7;
Q;51#7;
P;52#7;
T;53#13;
SP;54#14;
SP;55#20;
Q;61#20;
P;62#20;
T;63#21;
Q;71#21;
P;72#21;
T;73#22;
C;81


any help will be appreciated.

Victor
 

MaFu

Well-Known Member
Licensed User
This regex pattern should work:
(([A-Z]+;\d+#\d+;)|([A-Z]+;\d+))
 

Erel

Administrator
Staff member
Licensed User
This is not the correct place to post such questions. Always start a new thread for your question.
 
Top