Updated 2017-10-31 20:42:49 by resuna

regexp, a built-in Tcl command, matches a regular expression in a string.

Synopsis  edit

regexp ?switches? expr string ?matchVar? ?subMatchVar subMatchVar ...?

Documentation  edit

official reference
regular expression syntax

See Also  edit

Regular Expression
About regular expressions in general.
Regular Expressions
About Tcl regular expressions.
regsub
A companion command to [regexp] that in addition to matching, performs substitutions.
string match
switch
-indexvar and -matchvar can be used to capture both matched substrings and their indices in one operation.
glob

Bugs  edit

Mismatch between regexp -indices and switch -regexp -indexvar
Fixed in Tcl version 8.5.10.

Resources  edit

Visual REGEXP, by [Laurent Riesterer]
a graphical utility written in Tcl to illustrate regexp operation.

Description  edit

Determines whether the regular expression expr matches part or all of string. Returns 1 if it does, 0 if it doesn't.

Any additional arguments specified after string are the names of variables in which to return information about the captured match at the corresponding position in the string. MatchVar will be set to the range of string that matched all of expr. The first subMatchVar will contain the match of the first parenthesized subexpression within expr. The next subMatchVar will contain the characters that matched the next parenthesized subexpression to the right in exp, and so on.

If the initial arguments to [regexp]] start with - then they are treated as switches. The following switches are currently supported:
-all
Matches the expression multiple times and returns the total number of matches found. Any specific match variables contain only the last corresponding match.
-about
Instead of attempting to match the expression, returns a list containing information about the expression. The first element of the list is a subexpression count. The second element is a list of property names that describe various attributes of the expression. This switch is primarily intended for debugging purposes (see REGEXP DESCRIPTIVE FLAGS below).
-expanded
Enables use of the expanded expression syntax where whitespace and comments are ignored. This is the same as specifying the (?x) embedded option (see METASYNTAX, below).
-indices
Changes what is stored in the subMatchVars. Instead of storing the matching characters from string, each variable will contain a list of two integers giving the indices in string of the first and last characters in the matching range of characters. To obtain both matches and indices, use switch with both -indexvar and -matchvar.
-inline
Instead of placing matches in variables, returns a list of matches. Together with -all, iteratively matches the expression, each time concatenating the match and any submatches individually with the list, and returns that list. With -inline, match variables may not be specified.
-line
Enables newline-sensitive matching. By default, newline is a completely ordinary character with no special meaning. With this flag, ‘[^’ bracket expressions and ‘.’ never match newline, ‘^’ matches an empty string after any newline in addition to its normal function, and ‘$’ matches an empty string before any newline in addition to its normal function. This flag is equivalent to specifying both -linestop and -lineanchor, or the (?n) embedded option (see METASYNTAX, below).
-linestop
Changes the behavior of ‘[^’ bracket expressions and ‘.’ so that they stop at newlines. This is the same as specifying the (?p) embedded option (see METASYNTAX, below).
-lineanchor
Changes the behavior of ‘^’ and ‘$’ (the “anchors”) so they match the beginning and end of a line respectively. This is the same as specifying the (?w) embedded option (see METASYNTAX, below).
-nocase
Causes upper-case characters in string to be treated as lower case during the matching process.
-start index
Specifies a character index offset into the string to start matching the expression at. When using this switch, ‘^’ will not match the beginning of the line, and \A will still match the start of the string at index. If -indices is specified, the indices will be indexed starting from the absolute beginning of the input string. index will be constrained to the bounds of the input string.
--
Marks the end of switches. The argument following this one will be treated as exp even if it starts with a -.

All subMatchVar that don't exist are set to the empty string, or, if -indices was specified, to -1 -1. Existing subMatchVar variables are left alone unless there is a corresponding match.

Example  edit

puts {enter string:}
set input [read stdin]

if {[regexp {^abc} $input]} {
  puts yes
} else {
  puts no
}

Gotcha: subMatchVar Variables Don't Get Changed  edit

Consider this example:
foreach item $list {
        regexp {(some) (expression)} $item -> var1 var2
        lappend result $var2
}

If the expression doesn't match, some previous could be left over in $var2, and $result could get corrupted. Better to do a little checking:
foreach item $list {
        if {[regexp {(some) (expression)} $item -> var1 var2]} {
                lappend result $var2
        } else {
                ...
        }
}

Descriptive Flags  edit

More info about the return values from -about, written by DKF in Feb, 2007 (with further additions and clarifications by DKF from a bit later in italics):

" currently only exist for testing purposes. Going through the definitive list, I see:
REG_UBACKREF
Indicates that the RE contains backreferences, which forces a more expensive evaluation engine. (Note that this implies that there must be capturing parens, but there is no flag to indicate that.) A simple RE that has this flag set: (.)\1
REG_ULOOKAHEAD
Indicates that the RE contains lookahead constraints. A simple RE that has this flag set: foo(?=bar)
REG_UBOUNDS
Indicates that the RE contains bounded matches (i.e. counted ranges expressed in the form {m,n}). A simple RE that has this flag set: [a-c]{3,5}
REG_UBRACES
Indicates that the RE contains braces that are not bounds. A simple RE that has this flag set: a{}b
REG_UBSALNUM
Indicates that there's a would-be rich backslash-alphanumeric sequence. Only happens when switched to parsing non-advanced REs. A fairly-simple RE that has this flag set: (?e)\a
REG_UPBOTCH
Indicates an unbalanced close-parenthesis ("specification botch" according to a comment in the source!) A fairly-simple RE that has this flag set: (?e))
REG_UBBS
Indicates that there is a backslash inside a bracketed character set. A simple RE that has this flag set: [\w'?!.]
REG_UNONPOSIX
Indicates that the RE is not a POSIX RE. Happens a lot! The POSIX RE spec is restrictive.
REG_UUNSPEC
Indicates that the RE is asking for unspecified behaviour? It's not clear what "unspecified" really means here.
REG_UUNPORT
Indicates that the RE is unportable? Portable to what? I don't know.
REG_ULOCALE
Indicates that the RE is (potentially) dependent on the locale. Many sets of characters theoretically depend on the locale, but Tcl only actually has a single locale for REs so this is really a pointless RE gnosticism.
REG_UEMPTYMATCH
Indicates that the empty string is matched by the RE. A simple RE that has this flag set: .*
REG_UIMPOSSIBLE
Indicates that the RE cannot possibly match anything. (Not all "impossible" REs are detected though.) A fairly-simple RE that has this flag set: foo\m
REG_USHORTEST
Indicates that the RE is non-greedy, and so uses a different matching engine. A simple RE that has this flag set: a.??b

If you're not an RE wonk or matcher, I'd assert that virtually all of these are totally uninteresting. :-) The backrefs, lookahead and bounds are probably most interesting from a "describing what's in there" POV."

I can't see any value in UNONPOSIX, UUNSPEC, UUNPORT or ULOCALE; they just don't seem to correspond to any question I might ever wish to ask about a regular expression. UBSALNUM and UPBOTCH are very low-value too, as they only apply when you move the RE engine into a non-standard mode.

Counting occurrences  edit

Saravanan: Can any one tell how to retrieve the count of a particular character from the given string (using regexp only)? Eg: set a "hithisisisis". i need to find how many occurrences of 'i' from $a.

Lars H: Use the -all option:
% regexp -all i $a
5

How to use quotes and variables in regular expressions?

Feb 9th 2007 CJL wondered on Ask#5 what the correct/best/proper way of writing a regexp with quotes and the current value of a variable in the expression was? I want to match various patterns of the form <INPUT TYPE="TEXT" NAME="$something" SIZE="\d+" MAXLENGTH="\d" VALUE="\S+">, where $something has a range of values that is a subset of all possible values, i.e. I don't want to put \S+ in place of $something as that will give unwanted matches. Note the presence of quotes and escapes to complicate things.

MG Using format is probably one of the simplest.
set something "foobar"
set pattern {<INPUT TYPE="TEXT" NAME="%s" SIZE="\d+" MAXLENGTH="\d" VALUE="\S+">}
set pattern [format $pattern $something]

Assuming, of course, you don't have %-'s in your string. Otherwise, building it in steps may be easiest:
set something "foobar"
set pattern {<INPUT TYPE="TEXT" NAME="}
append pattern $something
append pattern {" SIZE="\d+" MAXLENGTH="\d" VALUE="\S+">}

LV I suspect the OP will need to replace those \d with %d and the \S with %s.

Regular Expressions Caching  edit

In the core the compiled version of the RE is stored in the Tcl_Obj holding the string passed to it, alongside its string representation. Tcl also dynamically caches up to 30 compiled regular expressions.

Since the compiled code is stored in the input Tcl_Obj, you can easily cause a larger number of RE's to be cached by (for example) assigning them to variables: if a regular expression is assigned to a variable and the variable is not changed, the Tcl core will save the compiled version of the RE and use the precompiled version of the object during the next evaluation.

To find #pragma <something> statements define a pattern like
set re {^\s*#\s*pragma\s+(.)}
if { [regexp $re $line -> rest] } {
    ...
}

The above example will cause the compiled regular expression to be stored $re.

From possible to precompile regexps, comp.lang.tcl, 2004-11-04.

The run-time benefit of regular expression caching can easily be shown:
# Run N different regexp patterns
proc test_regexps N {
    for {set i 0} {$i < $N} {incr i} {
        regexp "foobar$i" "foobar1"
    }
}
puts "29 Took: [time { test_regexps 29 } 100]"
puts "30 Took: [time { test_regexps 30 } 100]"
puts "31 Took: [time { test_regexps 31 } 100]"
puts "32 Took: [time { test_regexps 32 } 100]"

One run of this gave:
29 Took: 298 microseconds per iteration
30 Took: 372 microseconds per iteration
31 Took: 2000 microseconds per iteration
32 Took: 2107 microseconds per iteration

...clearly showing the extra cost of having to recompile each regexp pattern each time through' due to exceeding NUM_REGEXPS (30).

Using Regular Expressions to Strip Visually Blank Lines

DKF writes that it is hard to do this with any single RE on its own, though you can do it quite easily using a couple of things coupled together. This example uses regsub to strip the problematic lines, but cannot completely get rid of leading and trailing newlines without the extra string trim:
string trim [regsub -all {\n(?:\s*\n)+} $data \n] \n

However, I prefer selecting things positively, leading to a solution using regexp and join:
join [regexp -all -inline {(?=[^\n]*\S)[^\n]+} $data] \n

DKF 2006-08-10: More experimentation indicates that a single [regsub] can do the whole job:
regsub -all {^\n+|\n+$|(\n)+} $data {\1}

Note that the order of the alternatives is important!