Android Question Regex help

Robert Valentino · Sep 20, 2019

I am trying to parse people names (sometimes just first name, sometimes last name comma first name)

I tried this Regex expression that works at a online website but doesn't work in B4A

At 68 I am just so lame when it comes to Regex

emexes · Sep 20, 2019

That regex will grab a word (or words) at the beginning of a line. Can't tell whether it is specifically a name. Will work in B4X no problem; difficult to say why it is not working for you without seeing your code.

Better yet, post some sample text that you will be extracting from, showing the various forms that names will take, and ideally some that *not* match too.

Also, what to do about names that are:
- hyphenated (Robert Baden-Powell)
- spaced (Daniel De Silva) (cf LeBron James and Charles-François Lebrun)
- internally-capitalised (Maggie McIntosh)
- contracted (Rosie O'Donnell)

emexes · Sep 20, 2019

emexes said:
Also, what to do about names that are:

I once had a student name Li Ja Ny (or something like that) and I never did work out whether "Ja" was part of her first name or last name.

Similar to Lisa Marie Presley, where "Marie" seems to be part of the first name, rather than a middle name.

Robert Valentino · Sep 20, 2019

I am trying to parse text that looks like this

All I really need is the name

I'm on the road celebrating my 43rd anniversary and working on a laptop that is almost as old (only kidding) will post some code when I get home.

Just figured for you wiz kids this would be a no brain'er - this coming from someone who is slowly becoming a no brain'er

Oh... I should mention that much as the data looks like it is fixed column it isn't. I am reading and parsing a PDF and the data is varying length. Would be nice if it came in as fixed column would make parsing so much easier

drgottjr · Sep 20, 2019

looks like tab-delimited input to me. split on \t or chr(9), no?. then take field 1 (the name field). are you familiar with B4A's Regex.split? search for regex.split here in the forum

emexes · Sep 20, 2019

Looking at that data, and assuming that for some reason you cannot extract the name column by width, I'd be splitting it into fields by observing that:

- the first column is a number with at least one digit
- the second column is name comma name
- the third column always starts with a ( followed by a digit or an underscore

and so a regex like:

^[ ]*[0-9]+[ ]+([A-Z][^,]+),[ ]{0,2}([A-Z].*?)[ ]+\([0-9_]

should work. Not tested, but:

^ = start of line
[ ]* = optional leading spaces
[0-9]+ = one or more digits
[ ]+ = separation space(s)
( = capture surname
[A-Z] = begins with capital letter
[^,]* = then everything up to the comma
) = end of surname capture
, = the comma between last and first name
[ ]{0,2} = optional? multiple? spaces after comma
( = capture first name
[A-Z] = begins with capital letter
.*? = then everything else up until spaces--bracket-digit-or-underscore (non-greedy, so as to NOT grab the trailing spaces)
) = end of first name capture
[ ]+ = one or more spaces
\( = followed by a bracket
[0-9_] = and then a digit or an underscore

what could possibly go wrong? I guess we'll find out when you log the lines that DON'T match ;-)

emexes · Sep 20, 2019

Robert Valentino said:
I should mention that much as the data looks like it is fixed column it isn't. I am reading and parsing a PDF

drgottjr said:
looks like tab-delimited input to me

This made me smile - the optimism of youth!

drgottjr said:
split on \t or chr(9), no?.

Probably correct: no is the likely answer. Having said that, it's worth a go - you never know your luck, perhaps the PDF-to-text transform is adding tabs.

If not, then what might work is to split on [ ]\([0-9_]{3,3}\) which would split on the telephone area code (and the preceding space). The first field of that split would be number-lastname-comma-firstname, and if you took everything after the comma, that should leave you with just the first name (and some spaces that you can easily .Trim).

sorex · Sep 20, 2019

this regex work fine

\d+ (.*?) \(\d+

B4X:

Dim txt As String=$"some header stuff

128 sdffd sdfdsfdf (4545)
34       erere      (455)
64${TAB}adada${TAB}(535-5335-3553)
"$

txt=txt.Replace(TAB," ")

'Dim m As Matcher = Regex.Matcher("\d+ (.*?) \(\d+", txt)
Dim m As Matcher = Regex.Matcher("\d+(.*?)\(", txt)  'fix for (___ tel. numbers as mentioned below
Do While m.Find
 Log(m.Group(1).Trim)
Loop

Waiting for debugger to connect...
Program started.
sdffd sdfdsfdf
erere
adada

emexes · Sep 20, 2019

sorex said:
this regex work fine

if you don't mind losing the entries that have no phone number ;-)

Having said that, if the entire data is presented as one string, rather than a line at a time, then I have my own little oversight: the ^ start of line should really be \n, and so my final (well: current final) pattern is:

B4X:

Dim m As Matcher = Regex.Matcher("\n[ ]*[0-9]+[ ]+([A-Z][^,]+),[ ]{0,2}([A-Z].*?)(?:[ ]+(?:[A-Z]\.)+)*[ ]+\([0-9_]", txt)

which has the bonus features of:
- not capturing middle name initials
- presenting the names as two fields (last name and first name) so that can use in mailmerge eg "Dear Earl", rather than "Dear Bedney, Earl R.,"
- not capturing spaces adjoining names (although .Trim fixes that easy enough)

sorex · Sep 20, 2019

you mean that first number in the line?

that seems like an id that is always there.

but indeed it's guessing when you don't have the actual output to play with.

Edit: ah you mean those (___ ones?

this captures it fine

B4X:

 Regex.Matcher("\d+(.*?)\(", txt)

Robert Valentino · Sep 20, 2019

I don't know, I must be doing something wrong.

B4X:

           Dim Matcher1 As Matcher = Regex.Matcher("[ ]*[0-9]+[ ]+([A-Z][^,]+),[ ]{0,2}([A-Z].*?)[ ]+\([0-9_]", Player.Name)
'           
           Log("Try 1 - Player.Name:" &Player.Name)
           Do While Matcher1.Find
                 Log(Matcher1.Group(1).Trim)
           Loop
           
           Matcher1 = Regex.Matcher("\d+(.*?)\(", Player.Name)

           Log("Try 2 - Player.Name:" &Player.Name)           
           Do While Matcher1.Find
                 Log(Matcher1.Group(1).Trim)
           Loop

Try 1 - Player.Name:Allen, Tye (301)628-8687
Try 2 - Player.Name:Allen, Tye (301)628-8687

Neither of these Regex statements produces any resulsts

sorex · Sep 20, 2019

you've cut off the id in front of the name

try

B4X:

Matcher1 = Regex.Matcher("(.*?)\(", Player.Name)

Robert Valentino · Sep 20, 2019

YES, you are so right. Thanks so much

emexes · Sep 21, 2019

If Player.Name is the name and telephone number of a single player (quantitively, not maritally ;-) then perhaps it might be simpler just to delete the phone number ie everything from the "(" onwards, eg:

B4X:

Dim JustTheName As String = Player.Name.Trim
Dim P As Int = JustTheName.IndexOf("(")
If P >= 0 Then
    JustTheName = JustTheName.SubString2(0, P).Trim    'keep name / delete telephone number
End If

or with regex:

B4X:

Log( Regex.Split("\(", Player.Name)(0).Trim )

where the .Split returns an array of strings, the (0) gets you the first of those strings, and the .Trim removes the column-separator spaces.

emexes · Sep 21, 2019

Longer step-by-step version of regex:

B4X:

Dim S() As String = Regex.Split("\(", Player.Name)    'split string using "(" instead of the more-usual ","

Log("[" & Player.Name & "]")    'square brackets so can "see" leading/trailing spaces too
For I = 0 To S.Length - 1
    Log(I & " = [" & S(I) & "]")  
Next

Dim JustTheName As String = S(0).Trim    'name will be first field, ie everything up to but not including separator "("
Log(JustTheName)

sorex · Sep 21, 2019

yeah, in his last case it can be a oneliner but he's chaning the rules/data each time

maybe he's testing and will his final data be the one from the screenshot again.

Android Question Regex help

Robert Valentino

Well-Known Member

emexes

Expert

emexes

Expert

Robert Valentino

Well-Known Member

drgottjr

Expert

emexes

Expert

emexes

Expert

sorex

Expert

emexes

Expert

sorex

Expert

Robert Valentino

Well-Known Member

sorex

Expert

Robert Valentino

Well-Known Member

emexes

Expert

emexes

Expert

sorex

Expert

Similar Threads