I Hate Regex, Part 2

Heuristx

Active Member
Licensed User
Longtime User
When people say they hate RegEx, they don't mean they hate RegEx itself, they usually hate the syntax and morphology of RegEx.
That is why it is such a controversial topic.

The functionality of RegEx is useful, although internally it is quite dirty with rules that allow some ambiguity and self-contradiction, which it aims to correct with "backtracking". Just look at the number of Web sites where Regex gurus try to help, tutor and explain these ambiguities, and the Regex example sites, in how many different ways they try to define the same thing.

Still, it is the standard(which no implementation follows exactly), so it is worth having a tool that makes it faster and safer to use.

If you argue with "fast and safe", then pull out a stopwatch and see how long it takes for you to find the one character I changed in his RegEx and debug it:
A RegEx example:
^((?>[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+\x20*|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*"\x20*)*(?<angle><))?((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[(((?(?<!\[)\.)(25[0-5]|2[0-4]\d|[01]?\d?\d)){4}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])(?(angle)>)$

(And it is not even compatible with the Java Regex engine.)

One problem with this notation is that it is horizontal.
A programmer is used to breaking up a task into units that are vertically arranged. While RegEx is visually like a text, in reality it is full of loops and functions, and this arrangement fools the eye.
I will not even talk about the fact that a counter {1, 4} will crash because of the space after the comma...

Anyway, I hate the definition language but want to benefit from the functionality, so here is a B4X class for that.
I know that there is Verbal Expressions and the B4X RegexBuilder, but they are not hieararchical. They are linear like a StringBuilder(which is at their core). While strings are linear by definition(they ARE called strings!), RegEx is not.

So this class uses a stack and a tree structure for its elements. Because of it, it can decide when to enclose elements in different brackets([...], {...} or (...)).
Also, the class aims to encapsulate the complete Java RegEx with counted positive lookbehind groups, Unicode character classes, range intersection and so on.

An added functionality is that any element prepended with Named(Name) can be referred to, copied, and named groups can be applied to the ReplaceWith template.
The B4X Matcher does not use named groups, but it does support numbered groups. So the code:

Named group:
Dim r As RegexConsructor

r.Named("Part 1").GrpBegin

'...


r.TempAdd("[").TempAddGrp("Part 1")

will find the group number of "Part 1" and replace text with the group's contents.

The class has a Visualize function. Pass a TreeView(in B4J) to it and it will populate the treeview with its elements for debugging.

This class is hot out of the press, not tested very much and may change to make it simpler and faster to use. I deliberately avoided RegEx terms like lazy, greedy and eager, etc. from the class because they confused me when I sat down to write the class(a greedy search may find nothing, for example).

Instead of having "ZeroOrMore" and "OneOrMore" etc. quantifiers, counting is handled in a uniform way by passing a Minimum value which is the "at least" part and can be zero, and a Maximum part which, if -1(or the constant QtyAny) means "any number of".
Greediness, laziness and possessiveness can be introduced by appending the functions:

Quantity Types:
r.BeginQty(0, 1).CutShort

'or

r.BeginQty(0, 1).Extend

'or

r.BeginQty(0, 1).BlockRetries

which produce ??, *? and ?+ in RegEx.
(.Extend is only relevant when you copy a whole Quantity construct and want to change its behaviour in the copy.)

Quantifiers are also defined at the beginning, rather than in the end, but a QtyBegin....QtyEnd sequence will append the quantifier to the end of the thing. Also it will enclose whatever is between if necessary.

So, try it if you like. Let me know if I left bugs in it. Or if something is missing. Or if something is particularly annoying.

If people want it and it is polished up, it can be released as a library.

The B4J Demo program cannot produce or compile B4X commands, of course, so it is only meaningful if you get into the source of the B4XMainPage and manually modify the pattern.
 

Attachments

  • RegexConstructor Demo.zip
    493 KB · Views: 115
  • RegexConstructor.bas
    55.6 KB · Views: 102
Last edited:
Top