Drawbacks of Tcl's Regexps

This page serves as a list of current (June 2003) problems with Tcl's regexp library.

Bugs in implementation of TCL_REG_CANMATCH flag, which mean the library believes '.x' could match 'abc' at position zero (which is plain wrong!). The correct answer is that it could match at position 2, if more text were available.
The implementation is completely incomprehensible and thereby totally at odds with the rest of Tcl's wonderfully readable source code. Even trying to work out what a function like 'find' or 'cfind' is supposed to do (let alone what it actually does) is practically impossible.
No backwards matching support. The only way to find matches backwards is to iteratively search forwards until you find the last match.
Tcl doesn't expose any of the low-level regexp functions in its stubs table. This means any Tcl extension which wants to run a regexp over a buffer must first place that buffer into a Tcl_Obj (with Tcl_NewUnicodeObj, most likely). This is not a satisfactory solution if the buffer is large and already in Unicode format, since placing it in a Tcl_Obj will result in it being copied. Tcl should therefore expose RegExpExecUniChar (from tclRegexp.c) in its stubs table.
The library has very good properties for pure-greedy or pure-non-greedy matching (not shared by Perl, for example, which will not always return the longest/shortest match respectively), but the rules for mixed-greediness matching are very, very unclear.
"Can't Happen" bugs, e.g. [1]

So, what proposals have been submitted as TIPs to address the deficits?

Related question: who understands the code enough to be persuaded to write a TIP to address any of these items. (Since, without an implementor, any TIP would only languish)

[Compliments to Vince for this summary.]

DL: I would restate the Unicode issue slightly differently. I don't think that it is any big deal that objects have to be placed into a Tcl_Obj. First because this is pretty typical throughout Tcl and second, because you can still manipulate the data in the object directly. So the overhead of setting up a Tcl_Obj ONCE is not really that big a deal. If you compare this to some other things in Tcl that also only take Tcl_Objs, regexp looks like a pretty good example for the interface. Expect, for example, lets the regexp engine run over the data as it streams and the overhead is not a problem as the Tcl_Obj creation is only done once.

That said, a real negative that I would like to bring out is that Regexp engine does not support UTF8 but requires Unicode! This can be painful because Unicode is not necessarily a sensible choice for some apps. So this choice causes shimmering in these apps. I can understand why Henry choose to support only Unicode - offsets simply aren't efficient in UTF8. I don't know what a better solution is - just because I have a complaint with the existing implementation doesn't mean it isn't the best solution :P but I'm adding it here for the record anyway.

hans This is especially noticeable when doing lsearch -regexp on a large list with string elements; a unicode representation of every list item is created, and memory cost explodes. lsearch -glob does not create any unicode strings, memory usage remains constant. The only other way then is carefully using the regexp command in a foreach loop, which is slower.

DKF: On the last point (greedy/non-greedy) the rules are clear in a certain sense. That sense is that the first quantifier syntactically encountered in a RE determines whether the RE as a whole is greedy or not, and all quantifiers in the RE are either greedy or not greedy. There are no mixed-greediness REs, which is definitely unlike Perl. On the other hand, I've no idea what the automata-theoretic basis for a mixed-greediness RE is; the Perl code is widely accused (probably rightly) of being a horrible hack.

RJM inserts here that even non-mixed RE's may impose the forcing of greedy for non-greedy expressions. Let me use an example from Brent Welch's book, chapter 11, "Back References". The expression

 ('|").*?\1

definitely evaluates greedy. Due to DKF's explanation, I tried to add a "dummy" non-greedy expression in front of this expression, which may serve as a workaround where a string does not start with the search phrase, and where the naked result without quotes is desired:

 _*?.('|")(.*?)\1

(end of insertion)

JE: Don't expect an automata-theoretic basis for mixed-greediness regexps. A finite state automaton either reaches an accepting state or it doesn't; there is no record of how the automaton got there. So if you're concerned about which subexpressions of an RE matched what substrings of the input, automata theory is no help.

(With a DFA, it's easy to find the longest or shortest match for a regular expression as a whole -- halt as soon as you reach a final state for the shortest match, halt when you reach the dead state for the longest one -- but when it comes to mixed-greediness regexps, a different theoretical basis is needed. I don't know what that theoretical basis might be; if anyone knows of one I'd be very interested to hear about it.)

Lars H, 2008-03-04: An automata-theoretical foundation for regexps with submatch capturing, being capable of handling mixed greediness, can be found in [2].

The model is that every terminal (ordinary character or bracket expression) in the regular expression is assigned a unique priority, and that highest priority wins when there is a choice of how to match. The way these priorities would be assigned for our regular expressions is that priority is highest for the leftmost terminal and then decreases — thus the left branch in a choice is preferred over the right branch and the left factor in a juxtaposition expands at the expense of the right factor. What non-greediness in a quantifier seems to do is that they change this left-to-right rule for juxtapositioning, so that all terminals in the left factor get lower priority than those in the right factor.

Example: In the RE

  (a+)(a+)

the left a has higher priority than the right, but in the RE

  (a+?)(a+)

it has lower priority, because it's inside a non-greedy sub-RE. If one was to nest things, like

  ((a+)(a+)+?)(a+)

then the priorities of the as, from left to right, would be 1, 0, and 2 respectively (2 being the highest). Now, this may not be how Tcl treats this RE, but it is a reasonable interpretation of it.

Another thing from that paper is that the model doesn't have capturing parentheses as such, but rather "tag" terminals which match the empty string (like ^ and $) and have their position remembered. Making capturing parentheses from tags is trivial (put one tag before and another after), as is probably the other way around (put an empty capturing parenthesis at the position you want to tag), but choosing capturing parentheses as the fundamental concept seems rather much a heritage from the tradition of using backtracking RE engines.

Back to DKF:

I should note here that there is an exception to the rule on quantifier mixing, and that is for (non-zero-width) lookahead assertions, which are always non-greedy. But they're handled effectively by recursively calling the RE engine internally, and the boundary between which automata are being used is fairly obvious. They're also not very efficient, and you can't use them for subexpression capture.

NL There was a lengthy discussion [3] back in 2003 about the non-greedy/greedy mix. It seems that this a well known issue with Spencer's original (and Tcl) version. I doubt anyone would replace the Spencer library that was re-written for Tcl [4], and by the look of the code, I doubt anyone other then Mr. Spencer will be able to change/fix it.

Vince notes that DKF sounds suspiciously as if he understands the code and could therefore be someone to address some of these bugs...

Vince: Unfortunately, it's very hard to submit a TIP to address any of the above issues (except the addition of a function to the stubs table, which I could happily submit a TIP for), because the code is so incomprehensible that I don't know who (beyond Henry Spencer) would actually be in a position to implement any changes. That's partly why I put together the above list, to see whether discussion here would point to a way forward/out of this mess ;-).

Joe Mistachkin: After analyzing and attempting to isolate some of the existing bugs in the current regexp engine for countless hours, I agree with Vince on all points.

DKF: Me too. While I understand a lot of the theoretical basis for RE matching (producing a non-deterministic automata from an RE isn't exactly hard compared with most compilation tasks) Henry's code is not at all for the faint-hearted. It's not designed to be read by anyone other than the author, given the number of single-character variable names and interesting C-preprocessor tricks. If I had a few months uninterrupted effort available, I'm sure I could grasp what's going on, but I've not had that much totally free time since 1994...

Vince: Sounds like Henry's the only person who'll be able to help us, then. How about we collect together all of the known problems on this page (Joe alludes to some bugs, but I'm not sure what they are), and then see if someone on the TCT can get in touch with Henry...

One alternative for backward matching: reverse the string you're matching against, and construct a regular expression designed to work on a reversed string.

One alternative for replacing the existing system: implement a regexp engine in pure tcl (with someone with a good grounding in the theory -- perhaps a college professor who teaches theory of computation -- providing a strong influence) then start replacing the most inefficient pieces with compiled code. Perhaps implement several engines, for perspective (for example, both a NFA and a DFA).

Note that perl regular expressions provide a superset of regular expression functionality. Translated out of perlspeak, this means that when some of the notation is used the expressions are no longer regular. I'm not even certain they're context free. It might very well be that the replacement system doesn't have quite the same semantics as the current system.

CMcC: Looking at rxspencer-alpha3.8.g3.tar.gz here [5] I see a note from (I presume) Henry Spencer:

    New in alpha3.5:  Active development of this code has been stopped
    -- I'm working on a complete reimplementation

I'm guessing that what we have in tcl is that reimplementation, since the code is very different.

LV From personal experience with a variety of software systems that I've inherited over the past 30+ years, may I suggest that having an important chunk of code that is only comprehensible to one person (aka the golden child) is not the way that a system should remain. A variety of outcomes fall out from this situation.

Nothing ever has to change in the code and everything is golden for eternity. This is the state that most management hope is the case.
Some improvements or fixes are needed, and the golden child has to be convinced/bribed/blackmailed/coerced into making them.
Some major show stopper issue arises and the functionality has to be removed.
Some major show stopper issue arises and everyone just decides to live with the problem.
Programmers with varying levels of confidence go in and make fixes/changes, causing unexpected side effects.
A new golden child is found/hired/coerced into taking over the code.
The code is rewritten so as to be more comprehensible to the general maintenance group.

I figure I missed some, but I've lived through most of these... :-( ...

RLE (2012-03-11) A thread in the pgsql-hackers mailing list started by Tom Lane on Feb 18 discusses the PostgreSQL communities apparent decision to possibly begin making some changes to the TCL regular expression engine. PostgreSQL uses TCL's regular expression engine. See http://archives.postgresql.org/pgsql-hackers/2012-02/msg00782.php

. It might be worthwhile for the TCL core team to keep an eye on what the PostgreSQL community achieves, because their changes might be beneficial to TCL as well.

Additional threads regarding the TCL/PostgreSQL regex engine:

http://archives.postgresql.org/pgsql-hackers/2012-03/msg00141.php

This may be relevant https://groups.google.com/d/msg/comp.lang.tcl/FddeFPbTFw8/UA3RwHwxk8QJ

Category Internals