Updated 2013-10-14 14:30:42 by pooryorick

Perl-Compatible Regular Expressions. A superset of Regular Expressions with a few extra features introduced by Perl, less a couple of features that could not be enforced without Perl itself. Welcomed by many, hated by many others.

http://www.pcre.org/

Note that, in spite of the name, PCRE do not exist in Perl only. Other programs and/or languages can implement them, like PHP.

Tcl uses ARE.

DKF - Wow. There's more Features from the Black Lagoon in there than you can shake a B-Movie at...

A big Regular Expression fan sees DKF's remark and says: Maybe Tcl uses ARE because Tcl is so merciful. More features for the bold and less features for the queasy.

Another (OK, the same) big Regular Expression fan also says: Regular Expressions are not very easy, granted, but they're also overly mystified, and PCRE take the blame for some extra mystification, even by those who are good friends with Regular Expressions.

The thing is that PCRE allow tricks that are impossible with traditional RE. Some people advocate avoiding PCRE completely and, instead, writing even more complex, long-winded, probably convoluted code to replace them.

Big Regular Expression Fan will never understand these people.

Here is a very quick summary of PCRE's most relevant features. Items marked with + are supported by ARE (thank God).
 + foo(?=bar)            match "foo" only if "bar" follows it
 + foo(?!bar)            match "foo" only if "bar" does NOT follow it
  (?<=foo)bar            match "bar" only if "foo" precedes it
  (?<!foo)bar            match "bar" only if "foo" does NOT precede it

  (?<!in|on|at)foo       match "foo" only if NOT preceded by "in", "on" or "at"
  (?<=\d{3})(?<!999)foo         match "foo" only if preceded by 3 digits other than "999"

 + (?i)abc               case-insensitive match of abc, ABC, aBc, ABc, etc.
 + ab(?i)c               same as above; the (?i) applies throughout the pattern
  (ab(?i)c)              matches abc or abC; the outer parens make the difference!
 + (?m)                         multi-line pattern space: same as "s/FIND/REPL/M"
 + (?s)                         set "." to match newline also: same as "s/FIND/REPL/S"
 + (?x)                         ignore whitespace and #comments;
 + (?:abc)foo                 match "abcfoo", but do not capture 'abc' in \1
  (?:ab|cd)ef                 match "abef" or "cdef"; only 'cd' is captured in \1
 + (?#remark)xy                 match "xy"; remarks after "#" in the parens are ignored.

 (?(condition)yes-pattern)
 (?(condition)yes-pattern|no-pattern)
 ...matches conditionally, like "if" statements.

 (?R)                    recursive match. OK, this one is really tough.

 \l                make letters capital
 \L                make letters small until \E
 \u                make letters capital
 \U                make letters capital until \E
 \Q                escape all until \E
 \E                end of modifyer's action
 \G                end of previous match

Roy Terry, 22July2003: It seems that Tcl's (?=foo) and (?!foo) are equivalent to the PCRE (?<...) feature. No?

Big Regular Expression Fan, 22July2003: Not exactly.

  • "Look ahead assertions" - supported by ARE and PCRE:

foo(?=bar) will match "foo" only if "bar" follows it. For example:
 http://(?=www\.)

The RE above will only find Web addresses whose sub domain is 'www'.

foo(?!bar) will match "foo" only if "bar" does not follow it. So, conversely,
 http://(?!www\.)

...will only find Web addresses whose sub domain is NOT 'www'.

  • "Look behind assertions" - supported by PCRE only:

f(?<=foo)bar will match "bar" only if "foo" precedes it. For example:
 (?<=http://)ftp\.

The RE above will only find Web addresses whose sub domain is 'ftp', but their protocol actually is 'http'.

(?<!foo)bar will match "bar" only if "foo" does not precede it. Therefore, conversely,
 (?<!http://)ftp\.

...will only find Web addresses whose sub domain is 'ftp', expressely ruling out any one whose protocol is not 'http'.

Of course, in many cases they can be interchanged. You can say http://(?=ftp\.) instead of (?<=http://)ftp\.. But that may force you to look for what you really want "backwards", so to speak. I've actually encountered situations in which I could not do without "look behind assertions", but I cannot recall any of them right now. Stay tuned to this page. Breaking news at any moment.

Meanwhile, check out this beautifully formatted page: http://www.slabihoud.de/spampal/pcrepattern.html

DKF notes that look-behind assertions could probably be added to Tcl's RE package without stomping over the theoretical basis for it in any way worse than look-ahead assertions already do. But it would take some work from someone energetic...

Lars H, 2007-12-31: One thing I find a bit curious about the ARE lookaheads is that they are not limited to the subregexp in which they appear —
  ((?=foo)[a-z]+)o

matches "foo" with the capturing subexpression matching only "fo". This makes (?=) different from & of the grammar_fa regexps, and similarly (?!) different from what can be done with the grammar_fa !. I suspect the PCRE lookaheads are like the ARE ones, but you might want to check.

Info copied from Regular Expressions page:

Most common regular expression implementations (notable perl and direct derivatives of the PCRE library) exhibit poor performance in certain pathological cases. Henry Spencer's complete reimplementation as a "hybrid" engine appears to address some of those problems. See [1] for some fascinating benchmarks.

Lars H: One point here, if I read the PCRE manpages correctly, is that the "alternative" matching algorithm of PCRE (FA-based, hence linear time) cannot do capturing parentheses. The paper quoted above does however contain a remark that it is perfectly possible for FA-based RE engines to do capturing parentheses.

Note on theoretical background: The classical approach to regular expressions, by way of finite automata (FA), is about checking whether a string (or "word", as it tends to be called in the theoretical literature) in its entirety matches a regular expression (like including ^ and $ in every regexp). It is fairly easy to modify the setting so that one gets a "searching" regexp instead (add .* at the beginning, stop when reaching an accepting state), but finding what subexpressions were matched is nontrivial.

More features from PCRE...

  • Subroutines - This, along with recursion/conditionals listed above, is a key feature that escaped from Area 51 and is terrorizing the citizens of Middle America.
  • Named RE components - pure syntactic sugar.
  • Matching of bytes of UTF8 with \C - utterly contrary to the way Tcl handles strings. We match against characters, not bytes, so if you wanted to work at the byte level you'd need to use the encoding command first.
  • Posessive quantifiers - these are a guru feature anyway.
  • Callouts - See what I say about Subroutines...

As a side note on recursive matching, that alters the language from being expressible with a Finite Automaton to requiring a full Turing Machine, and hence not an RE language any more but a full programming language with a horrible syntax. If you want that sort of power, use Tcl for real. :^D

Lars H: You're sure it doesn't just change the language into context-free or context-sensitive [2]? Not that a linear-bounded non-deterministic Turing machine should be much better than the full thing, though.