Android Question Regex help

Robert Valentino

Well-Known Member
Licensed User
I am trying to parse people names (sometimes just first name, sometimes last name comma first name)

I tried this Regex expression that works at a online website but doesn't work in B4A

upload_2019-9-19_20-48-52.png



At 68 I am just so lame when it comes to Regex
 

emexes

Well-Known Member
Licensed User
That regex will grab a word (or words) at the beginning of a line. Can't tell whether it is specifically a name. Will work in B4X no problem; difficult to say why it is not working for you without seeing your code.

Better yet, post some sample text that you will be extracting from, showing the various forms that names will take, and ideally some that *not* match too.

Also, what to do about names that are:
- hyphenated (Robert Baden-Powell)
- spaced (Daniel De Silva) (cf LeBron James and Charles-François Lebrun)
- internally-capitalised (Maggie McIntosh)
- contracted (Rosie O'Donnell)
 
Last edited:

emexes

Well-Known Member
Licensed User
Also, what to do about names that are:
I once had a student name Li Ja Ny (or something like that) and I never did work out whether "Ja" was part of her first name or last name.

Similar to Lisa Marie Presley, where "Marie" seems to be part of the first name, rather than a middle name.
 

Robert Valentino

Well-Known Member
Licensed User
I am trying to parse text that looks like this

upload_2019-9-19_21-27-9.png


All I really need is the name

I'm on the road celebrating my 43rd anniversary and working on a laptop that is almost as old (only kidding) will post some code when I get home.

Just figured for you wiz kids this would be a no brain'er - this coming from someone who is slowly becoming a no brain'er

Oh... I should mention that much as the data looks like it is fixed column it isn't. I am reading and parsing a PDF and the data is varying length. Would be nice if it came in as fixed column would make parsing so much easier
 
Last edited:

drgottjr

Well-Known Member
Licensed User
looks like tab-delimited input to me. split on \t or chr(9), no?. then take field 1 (the name field). are you familiar with B4A's Regex.split? search for regex.split here in the forum
 

emexes

Well-Known Member
Licensed User
Looking at that data, and assuming that for some reason you cannot extract the name column by width, I'd be splitting it into fields by observing that:

- the first column is a number with at least one digit
- the second column is name comma name
- the third column always starts with a ( followed by a digit or an underscore

and so a regex like:

^[ ]*[0-9]+[ ]+([A-Z][^,]+),[ ]{0,2}([A-Z].*?)[ ]+\([0-9_]

should work. Not tested, but:

^ = start of line
[ ]* = optional leading spaces
[0-9]+ = one or more digits
[ ]+ = separation space(s)
( = capture surname
[A-Z] = begins with capital letter
[^,]* = then everything up to the comma
) = end of surname capture
, = the comma between last and first name
[ ]{0,2} = optional? multiple? spaces after comma
( = capture first name
[A-Z] = begins with capital letter
.*? = then everything else up until spaces--bracket-digit-or-underscore (non-greedy, so as to NOT grab the trailing spaces)
) = end of first name capture
[ ]+ = one or more spaces
\( = followed by a bracket
[0-9_] = and then a digit or an underscore

what could possibly go wrong? I guess we'll find out when you log the lines that DON'T match ;-)
 

emexes

Well-Known Member
Licensed User
I should mention that much as the data looks like it is fixed column it isn't. I am reading and parsing a PDF
looks like tab-delimited input to me
This made me smile - the optimism of youth!

split on \t or chr(9), no?.
Probably correct: no is the likely answer. Having said that, it's worth a go - you never know your luck, perhaps the PDF-to-text transform is adding tabs.

If not, then what might work is to split on [ ]\([0-9_]{3,3}\) which would split on the telephone area code (and the preceding space). The first field of that split would be number-lastname-comma-firstname, and if you took everything after the comma, that should leave you with just the first name (and some spaces that you can easily .Trim).
 

sorex

Expert
Licensed User
this regex work fine

\d+ (.*?) \(\d+

B4X:
Dim txt As String=$"some header stuff

128 sdffd sdfdsfdf (4545)
34       erere      (455)
64${TAB}adada${TAB}(535-5335-3553)
"$

txt=txt.Replace(TAB," ")

'Dim m As Matcher = Regex.Matcher("\d+ (.*?) \(\d+", txt)
Dim m As Matcher = Regex.Matcher("\d+(.*?)\(", txt)  'fix for (___ tel. numbers as mentioned below
Do While m.Find
 Log(m.Group(1).Trim)
Loop
Waiting for debugger to connect...
Program started.
sdffd sdfdsfdf
erere
adada
 
Last edited:

emexes

Well-Known Member
Licensed User
this regex work fine
if you don't mind losing the entries that have no phone number ;-)

Having said that, if the entire data is presented as one string, rather than a line at a time, then I have my own little oversight: the ^ start of line should really be \n, and so my final (well: current final) pattern is:
B4X:
Dim m As Matcher = Regex.Matcher("\n[ ]*[0-9]+[ ]+([A-Z][^,]+),[ ]{0,2}([A-Z].*?)(?:[ ]+(?:[A-Z]\.)+)*[ ]+\([0-9_]", txt)
which has the bonus features of:
- not capturing middle name initials
- presenting the names as two fields (last name and first name) so that can use in mailmerge eg "Dear Earl", rather than "Dear Bedney, Earl R.,"
- not capturing spaces adjoining names (although .Trim fixes that easy enough)
 

sorex

Expert
Licensed User
you mean that first number in the line?

that seems like an id that is always there.

but indeed it's guessing when you don't have the actual output to play with.


Edit: ah you mean those (___ ones?

this captures it fine

B4X:
 Regex.Matcher("\d+(.*?)\(", txt)
 

Robert Valentino

Well-Known Member
Licensed User
I don't know, I must be doing something wrong.

B4X:
           Dim Matcher1 As Matcher = Regex.Matcher("[ ]*[0-9]+[ ]+([A-Z][^,]+),[ ]{0,2}([A-Z].*?)[ ]+\([0-9_]", Player.Name)
'           
           Log("Try 1 - Player.Name:" &Player.Name)
           Do While Matcher1.Find
                 Log(Matcher1.Group(1).Trim)
           Loop
           
           Matcher1 = Regex.Matcher("\d+(.*?)\(", Player.Name)

           Log("Try 2 - Player.Name:" &Player.Name)           
           Do While Matcher1.Find
                 Log(Matcher1.Group(1).Trim)
           Loop
Try 1 - Player.Name:Allen, Tye (301)628-8687
Try 2 - Player.Name:Allen, Tye (301)628-8687

Neither of these Regex statements produces any resulsts
 

sorex

Expert
Licensed User
you've cut off the id in front of the name

try

B4X:
Matcher1 = Regex.Matcher("(.*?)\(", Player.Name)
 

emexes

Well-Known Member
Licensed User
If Player.Name is the name and telephone number of a single player (quantitively, not maritally ;-) then perhaps it might be simpler just to delete the phone number ie everything from the "(" onwards, eg:
B4X:
Dim JustTheName As String = Player.Name.Trim
Dim P As Int = JustTheName.IndexOf("(")
If P >= 0 Then
    JustTheName = JustTheName.SubString2(0, P).Trim    'keep name / delete telephone number
End If
or with regex:
B4X:
Log( Regex.Split("\(", Player.Name)(0).Trim )
where the .Split returns an array of strings, the (0) gets you the first of those strings, and the .Trim removes the column-separator spaces.
 

emexes

Well-Known Member
Licensed User
Longer step-by-step version of regex:
B4X:
Dim S() As String = Regex.Split("\(", Player.Name)    'split string using "(" instead of the more-usual ","

Log("[" & Player.Name & "]")    'square brackets so can "see" leading/trailing spaces too
For I = 0 To S.Length - 1
    Log(I & " = [" & S(I) & "]")  
Next

Dim JustTheName As String = S(0).Trim    'name will be first field, ie everything up to but not including separator "("
Log(JustTheName)
 

sorex

Expert
Licensed User
yeah, in his last case it can be a oneliner but he's chaning the rules/data each time :)

maybe he's testing and will his final data be the one from the screenshot again.
 
Top